0% found this document useful (0 votes)
12 views58 pages

Lec17 ReinforcementLearning

The document provides an introduction to Reinforcement Learning (RL), detailing its principles, including the interaction of agents with environments to maximize rewards through learned policies. It explains the Markov Decision Process (MDP) framework, the importance of value functions, and the distinction between value-based and policy-based methods. Additionally, it discusses Q-learning and Deep Q-learning, highlighting their algorithms and challenges in training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views58 pages

Lec17 ReinforcementLearning

The document provides an introduction to Reinforcement Learning (RL), detailing its principles, including the interaction of agents with environments to maximize rewards through learned policies. It explains the Markov Decision Process (MDP) framework, the importance of value functions, and the distinction between value-based and policy-based methods. Additionally, it discusses Q-learning and Deep Q-learning, highlighting their algorithms and challenges in training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

INTRODUCTION TO MACHINE LEARNING

Reinforcement Learning

Giovanni Iacca

(credits: Elisa Ricci)


Reinforcement
Learning
IDEA
Reinforcement learning
● We discussed supervised and unsupervised learning
● Today we will see Reinforcement Learning (RL)
Reinforcement learning
● Inspired by research on psychology and animal learning
● Problems involving an agent interacting with an environment, which
provides numeric reward signals
● Goal: Learn how to take actions to maximize a reward
Reinforcement learning
● Agent can take actions that affect
the state of the environment and
observe occasional rewards that
depend on the state
● A policy is a mapping from states to
actions
Reward rt
● Goal: Learn a policy to maximize Next State st+1
Action at

expected reward over time

State st
Example – Atari Games
● Objective: Complete the game with the highest score
● State: Raw pixel inputs of the game state
● Action: Game controls, e.g., Left, Right, Up, Down
● Reward: Score increase/decrease at each time step

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, et al, Human-level control through deep reinforcement learning, Nature 2015
Example – Go
● Objective: Win the game
● State: Position of all pieces
● Action: Where to put the next piece down
● Reward: 1 if win, 0 otherwise (“delayed” reward at the end of a game)

https://deepmind.com/research/alphago/
Example – ChatGPT
Markov
Decision
Process
Markov Decision Process
● Markov Decision process (MDP) is a framework used to help to make
decisions on a stochastic environment.
● Our goal is to find a policy, which is a map that gives us all optimal
actions on each state on our environment.
● To solve MDPs, we need Dynamic Programming (DP), more
specifically the Bellman equation.
● DP is a method that divides a problem into simpler sub-problems that
are easier to solve.
Markov Decision Process
● Components:
○ States s, beginning with initial state s0
○ Actions a
○ Transition model P(s' | s , a)
■ Markov assumption: the probability of going to s' from s depends
only on s and a and not on any other past actions/states
○ Reward function r(s)
● Policy 𝛑(s): the action that an agent takes in any given state
Markov Decision Process
● An MDP is defined by:
(𝓢, 𝓐, 𝓡, ℙ, 𝛾)
𝓢 : set of possible states
𝓐 : set of possible actions
𝓡 : distribution of reward given (state, action) pair
ℙ : transition probability, i.e., distribution over the next state, given
(state, action) pair
𝛾 : discount factor
Example – Grid World
Objective: reach the diamond terminal state in least number of actions

Reward: scalar value that


you get for being on a state
Example – Grid World
Transition model:

0.1 0.8 0.1


Example – Grid World
Goal: find the optimal policy

The policy is a map that tells the optimal action for every state.
The optimal policy is a policy that maximizes the expected reward.
Example – Grid World
The optimal policy depends on the reward function
MDP Loop
● At time step t=0, environment samples initial state s0 ~ p(s0)
● Repeat:
○ Agent selects action at
○ Environment samples reward rt ~ R( . | st , at )
○ Environment samples next state st +1 ~ P( . | st , at )
○ Agent receives reward rt and next state st +1

A policy 𝛑(s) is a function from 𝓢 to 𝓐 that specifies what action to take in each state.
Objective: find policy 𝛑* that maximizes cumulative discounted reward.
Cumulative Discounted Reward
● Suppose that following policy 𝛑 , starting in state s0 , leads to a sequence s0 s1 s2 ....
● The cumulative reward of the sequence is:

● State sequences can vary in length or even be infinite


● Typically, we define the cumulative reward as sum of rewards discounted by a factor 𝛄:
Cumulative Discounted Reward
● The discounting factor controls the importance of the future rewards versus the immediate ones.
● The lower the discount factor is, the less important future rewards are, and the agent will tend
to focus on actions which will yield immediate rewards only.
● The cumulative reward is bounded by
● Helps algorithm to converge.
RL vs. Supervised Learning
● Supervised Learning loop
○ Get input xi sampled i.i.d. from data distribution
○ Use model with parameters w to predict output y
○ Observe target output yi and loss l(w, xi , yi)
○ Update w to reduce loss with SGD:
RL vs. Supervised Learning
● Reinforcement Learning loop
○ From state s, take action a determined by policy 𝛑(s)
○ Environment selects next state s' based on transition model P(s' | s, a)
○ Observe s' and reward r(s), update policy
RL vs. Supervised Learning
● Supervised Learning
○ Next input does not depend on previous inputs or agent predictions
○ There is a supervision signal at every step
○ Loss is differentiable w.r.t. model parameters

● Reinforcement Learning
○ Agent’s actions affect the environment and help to determine next observation
○ Rewards may be sparse (i.e., not every state may have a reward)
○ Rewards are usually not differentiable w.r.t. model parameters
RL Methods
Two main approaches for RL
● Value-based methods
○ The goal of the agent is to optimize the value function V(s).
○ The value of each state is the total reward an RL agent can expect to collect over
the future from a given state.
● Policy-based approach:
○ We define a policy which we need to optimize directly.
○ The policy defines how the agent behaves.
○ Stochastic policies give a distribution of probability over different actions:
Value based
methods
Value Function
● The value function gives the total reward the agent can expect from a particular state
considering all possible states reachable from that state. With the value function, you can find
a policy.
● The value function V of a state s w.r.t. policy 𝛑 is the expected cumulative reward following
that policy starting in s:

● The optimal value of a state is the value achievable by following the best possible policy:

● Essentially, the value function tells “how good” is a state.


Q-value Function
● It is more convenient to define the value of a state-action pair:

● The optimal Q-value function tells “how good” is a state-action pair:

● When the optimal Q-value is found, it is used to compute the optimal policy:
Q-value Function

A table where we have the maximum


expected future reward, for each
action at each state.
Bellman Equation
● Recursive relationship between optimal values of successive states and actions:

● If the optimal state-action values for the next time-step Q*(s', a') are known, then the
optimal strategy is to take the action that maximizes the expected value.
Q-learning
A robot needs to reach room 5.

https://leonardoaraujosantos.gitbook.io/artificial-inteligence/artificial_intelligence/reinforcement_learning/qlearning_simple
Q-learning
● Components
○ Actions: {0,1,2,3,4,5}
○ States: {0,1,2,3,4,5}
○ Rewards: {0,100}
● Goal state: 5

NOTE: in this specific example,


the action and state spaces are the
same; however, this is not always
the case!
Q-learning
● Reward Table (the value -1 indicates that some specific action is not available)
Q-learning
● The whole point of Q-learning is that the matrix R is available only to the environment; the agent
needs to learn R by itself through experience.
● What the agent has is a Q matrix that encodes the [(state, action) à reward] mapping, but is
initialized with zeros and through experience becomes like the matrix R.
● The policy can be obtained from the Q matrix.
Algorithm
● Initialize the Q matrix with zeros
● Select a random initial state
● For each episode (i.e., a set of actions that starts on the initial state and ends on the
goal state, or until another stop criteria is met, e.g. on the no. of timesteps):
● While state is not the goal state (or stop criteria are not met)
○ Select a random possible action for the current state
○ Using this possible action, consider going to the next state
○ Get maximum Q value for the next state (on all possible actions on the next state)
○ Q*(s, a)=R(s, a)+𝛾 maxa' [Q*(s', a')]
Algorithm
● Initialize the Q matrix with zeros
● Select a random initial state
● For each episode (i.e., a set of actions that starts on the initial state and ends on the
goal state, or until another stop criteria is met, e.g. on the no. of timesteps):
● While state is not the goal state (or stop criteria are not met)
○ Select a random possible action for the current state
○ Using this possible action, consider going to the next state
○ Get maximum Q value for the next state (on all possible actions on the next state)
○ Q*(s, a)=R(s, a)+𝛾 maxa' [Q*(s', a')]

𝛾=0.8 s=1
Algorithm
● Initialize the Q matrix with zeros
● Select a random initial state
● For each episode (i.e., a set of actions that starts on the initial state and ends on the
goal state, or until another stop criteria is met, e.g. on the no. of timesteps):
● While state is not the goal state (or stop criteria are not met)
○ Select a random possible action for the current state
○ Using this possible action, consider going to the next state
○ Get maximum Q value for the next state (on all possible actions on the next state)
○ Q*(s, a)=R(s, a)+𝛾 maxa' [Q*(s', a')]

As we start from state s=1 (second row) there are only


𝛾=0.8 s=1 the actions 3 (reward 0) or 5 (reward 100) to be done,
imagine that we choose randomly the action 5.
Algorithm
● Initialize the Q matrix with zeros
● Select a random initial state
● For each episode (i.e., a set of actions that starts on the initial state and ends on the
goal state, or until another stop criteria is met, e.g. on the no. of timesteps):
● While state is not the goal state (or stop criteria are not met)
○ Select a random possible action for the current state
○ Using this possible action, consider going to the next state
○ Get maximum Q value for the next state (on all possible actions on the next state)
○ Q*(s, a)=R(s, a)+𝛾 maxa' [Q*(s', a')]

On state 5, there are 3 possible actions. We're just Episode 1


𝛾=0.8 s=1 interested on the action with biggest reward. But,
at this point the Q table is still filled with zeros!
Algorithm
● Initialize the Q matrix with zeros
● Select a random initial state
● For each episode (i.e., a set of actions that starts on the initial state and ends on the
goal state, or until another stop criteria is met, e.g. on the no. of timesteps):
● While state is not the goal state (or stop criteria are not met)
○ Select a random possible action for the current state
○ Using this possible action, consider going to the next state
○ Get maximum Q value for the next state (on all possible actions on the next state)
○ Q*(s, a)=R(s, a)+𝛾 maxa' [Q*(s', a')]

As the new state is 5 and this state is the goal Episode 1


𝛾=0.8 s=1 state, we finish our episode. Now at the end of this
episode the we update the Q table.
Algorithm
● Initialize the Q matrix with zeros
● Select a random initial state
● For each episode (i.e., a set of actions that starts on the initial state and ends on the
goal state, or until another stop criteria is met, e.g. on the no. of timesteps):
● While state is not the goal state (or stop criteria are not met)
○ Select a random possible action for the current state
○ Using this possible action, consider going to the next state
○ Get maximum Q value for the next state (on all possible actions on the next state)
○ Q*(s, a)=R(s, a)+𝛾 maxa' [Q*(s', a')]

𝛾=0.8 s=1
Algorithm
● Initialize the Q matrix with zeros
● Select a random initial state
● For each episode (i.e., a set of actions that starts on the initial state and ends on the
goal state, or until another stop criteria is met, e.g. on the no. of timesteps):
● While state is not the goal state (or stop criteria are not met)
○ Select a random possible action for the current state
○ Using this possible action, consider going to the next state
○ Get maximum Q value for the next state (on all possible actions on the next state)
○ Q*(s, a)=R(s, a)+𝛾 maxa' [Q*(s', a')]

Typically, all the non-zero


elements are divided by
they greatest value to
normalize the Q table.

After many episodes


𝛾=0.8 s=1
Find optimal Policy
1. Set current state = initial state.
2. From current state, find the action with the highest Q value
3. Set current state = next state (state from action chosen on 2).
4. Repeat Steps 2 and 3 until current state = goal state.
Deep Q-learning

● The Bellman equation is a constraint on Q-values of


successive states:

● Problem: state spaces for interesting problems are


huge (e.g., Atari games)
● Solution: approximate Q-values using a parametric
function (w being the parameters):
Deep Q-learning

● Train a deep network w that approximates Q:

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, Human-level control through deep
reinforcement learning, Nature 2015
Deep Q-learning

● Idea: at each iteration i of training, update model parameters w to


push Q close to the target yi

● Loss function (that changes at each iteration):

where ρ is a probability distribution over states s and actions a that we


refer to as the behaviour distribution .

https://shivam5.github.io/drl/
Deep Q-learning
● Target:
● Loss:
● Gradient update:

● SGD training: replace expectation by sampling experiences (s, a, s')


using behaviour distribution and transition model

https://shivam5.github.io/drl/
Deep Q-learning
● Training is prone to instability
○ Unlike in supervised learning, the targets are “moving”
○ Successive experiences are correlated and depend on the policy
○ Policy may change rapidly with slight changes to parameters, leading to
drastic changes in data distribution
● Solutions
○ “Freeze” target Q-network
○ Use experience replay buffer to store experience and sample from that
Deep Q-learning in Atari

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, Human-level control through deep reinforcement
learning, Nature 2015
Deep Q-learning in Atari

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, Human-level control through deep reinforcement
learning, Nature 2015
Deep Q-learning in Atari

https://www.youtube.com/watch?v=V1eYniJ0Rnk
Policy
Gradient
methods
Policy gradient methods

● Instead of indirectly representing the policy using


Q-values, it can be more efficient to parameterize
𝛑 and learn it directly
● Especially in large or continuous action spaces the
Q-value function can be very complicated
● Example: a robot grasping an object has a very
high-dimensional state and it is hard to learn the
exact value of every (state, action) pair
Stochastic Policy Representation
Instead, learn a function giving the probability distribution over
actions from current state:
Policy gradient methods

Policy gradient for Pong game:


The basic idea is to use a Machine Learning model
that will learn a good policy from playing the game
and receiving rewards.
Objective function
Find the best parameters θ (parameters of the policy) to
maximize the expected reward (use gradient descent):
Optimization

We don’t know the


transition probability

NOTE: We do not need to know the the environment


dynamics p.
Optimization

Stochastic approximation: sample N trajectories:


Reinforcement Loop

Williams et al. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning,
8(3):229-256, 1992
Intuition
If going up the hill (of the objective function) means
higher rewards, we will change the model parameters
and thus the policy to increase the likelihood of
trajectories that move higher during the optimization
process.
QUESTIONS?

You might also like