Reinforcement Learning: Karan Kathpalia
Reinforcement Learning: Karan Kathpalia
Reinforcement Learning: Karan Kathpalia
Karan Kathpalia
Overview
• Introduction to Reinforcement Learning
• Finite Markov Decision Processes
• Temporal-Difference Learning (SARSA, Q-
learning, Deep Q-Networks)
• Policy Gradient Methods (Finite Difference
Policy Gradient, REINFORCE, Actor-Critic)
• Asynchronous Reinforcement Learning
Introduction to Reinforcement
Learning
Chapter 1 – Reinforcement Learning: An Introduction
Imitation Learning Lecture Slides from CMU Deep
Reinforcement Learning Course
What is Reinforcement Learning?
• Learning from interaction with an environment to
achieve some long-term goal that is related to the
state of the environment
• The goal is defined by reward signal, which must be
maximised
• Agent must be able to partially/fully sense the
environment state and take actions to influence the
environment state
• The state is typically described with a feature-vector
Exploration versus Exploitation
• We want a reinforcement learning agent to earn lots of
reward
• The agent must prefer past actions that have been found to
be effective at producing reward
• The agent must exploit what it already knows to obtain
reward
• The agent must select untested actions to discover reward-
producing actions
• The agent must explore actions to make better action
selections in the future
• Trade-off between exploration and exploitation
Reinforcement Learning Systems
• Reinforcement learning systems have 4 main
elements:
– Policy
– Reward signal
– Value function
– Optional model of the environment
Policy
• A policy is a mapping from the perceived
states of the environment to actions to be
taken when in those states
• A reinforcement learning agent uses a policy
to select actions given the current
environment state
Reward Signal
• The reward signal defines the goal
• On each time step, the environment sends a
single number called the reward to the
reinforcement learning agent
• The agent’s objective is to maximise the total
reward that it receives over the long run
• The reward signal is used to alter the policy
Value Function (1)
• The reward signal indicates what is good in the short run
while the value function indicates what is good in the
long run
• The value of a state is the total amount of reward an
agent can expect to accumulate over the future, starting
in that state
• Compute the value using the states that are likely to
follow the current state and the rewards available in
those states
• Future rewards may be time-discounted with a factor in
the interval [0, 1]
Value Function (2)
• Use the values to make and evaluate decisions
• Action choices are made based on value
judgements
• Prefer actions that bring about states of highest
value instead of highest reward
• Rewards are given directly by the environment
• Values must continually be re-estimated from
the sequence of observations that an agent
makes over its lifetime
Model-free versus Model-based
• A model of the environment allows inferences to be made about
how the environment will behave
• Example: Given a state and an action to be taken while in that
state, the model could predict the next state and the next reward
• Models are used for planning, which means deciding on a course
of action by considering possible future situations before they are
experienced
• Model-based methods use models and planning. Think of this as
modelling the dynamics p(s’ | s, a)
• Model-free methods learn exclusively from trial-and-error (i.e. no
modelling of the environment)
• This presentation focuses on model-free methods
On-policy versus Off-policy
• An on-policy agent learns only about the
policy that it is executing
• An off-policy agent learns about a policy or
policies different from the one that it is
executing
Credit Assignment Problem
• Given a sequence of states and actions, and the
final sum of time-discounted future rewards,
how do we infer which actions were effective at
producing lots of reward and which actions
were not effective?
• How do we assign credit for the observed
rewards given a sequence of actions over time?
• Every reinforcement learning algorithm must
address this problem
Reward Design
• We need rewards to guide the agent to achieve
its goal
• Option 1: Hand-designed reward functions
• This is a black art
• Option 2: Learn rewards from demonstrations
• Instead of having a human expert tune a system
to achieve the desired behaviour, the expert can
demonstrate desired behaviour and the robot
can tune itself to match the demonstration
What is Deep Reinforcement Learning?
• Deep reinforcement learning is standard reinforcement
learning where a deep neural network is used to approximate
either a policy or a value function
• Deep neural networks require lots of real/simulated interaction
with the environment to learn
• Lots of trials/interactions is possible in simulated environments
• We can easily parallelise the trials/interaction in simulated
environments
• We cannot do this with robotics (no simulations) because
action execution takes time, accidents/failures are expensive
and there are safety concerns
Finite Markov Decision Processes