0% found this document useful (0 votes)
17 views22 pages

IntroductiontoRL BR

The document provides an introduction to reinforcement learning (RL), explaining its principles, types, and key components such as policies, reward signals, and value functions. It distinguishes between model-based and model-free approaches, explores the exploration-exploitation trade-off, and discusses the credit assignment problem and reward design. Additionally, it highlights deep reinforcement learning, emphasizing its reliance on deep neural networks and the advantages of parallelism in training.

Uploaded by

Simhadri Sevitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views22 pages

IntroductiontoRL BR

The document provides an introduction to reinforcement learning (RL), explaining its principles, types, and key components such as policies, reward signals, and value functions. It distinguishes between model-based and model-free approaches, explores the exploration-exploitation trade-off, and discusses the credit assignment problem and reward design. Additionally, it highlights deep reinforcement learning, emphasizing its reliance on deep neural networks and the advantages of parallelism in training.

Uploaded by

Simhadri Sevitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Introduction to Reinforcement

Learning

Chapter 1 – Reinforcement Learning: An


Introduction
Imitation Learning Lecture Slides from CMU Deep
Reinforcement Learning Course
What is learning?
Learning takes place as a result of
interaction between an agent and the
world, the idea behind learning is that
Percepts received by an agent should be
used not only for acting, but also for
improving the agent’s ability to behave
optimally in the future to achieve the goal.
Learning types
Learning types
Supervised learning:
a situation in which sample (input, output) pairs of
the function to be learned can be perceived or are
given
 You can think it as if there is a kind teacher
Reinforcement learning:
in the case of the agent acts on its environment, it
receives some evaluation of its action
(reinforcement), but is not told of which action is the
correct one to achieve its goal
What is Reinforcement
Learning?
Learning from interaction with an
environment to achieve some long-term
goal that is related to the state of the
environment
The goal is defined by reward signal, which
must be maximised
Agent must able to partially/fully sense the
environment state and take actions to
influence the environment state
The state is typically described with a
feature-vector
RL is learning from
interaction
RL model

Each percept(e) is enough to determine the


State(the state is accessible)
The agent can decompose the Reward
component from a percept.
The agent task: to find a optimal policy,
mapping states to actions, that maximize long-
run measure of the reinforcement
Think of reinforcement as reward
Can be modeled as MDP model!
 The Markov decision process (MDP) is a mathematical
framework used for modeling decision-making
problems where the outcomes are partly random and
partly controllable.
 It's a framework that can address most reinforcement
Review of MDP model
MDP model <S,T,A,R>

• S– set of states
Agent • A– set of actions
• T(s,a,s’) = P(s’|s,a)– the
State Action probability of transition from
Reward
s to s’ given action a
Environment • R(s,a)– the expected reward
for taking action a in state s
R ( s, a )  P ( s ' | s, a )r ( s, a, s ' )
a0 a1 a2 s'
s0 s1 s2 s3 R ( s, a )  T ( s, a, s ' )r ( s, a, s ' )
r0 r1 r2 s'
Exploration versus Exploitation
We want a reinforcement learning agent to
earn lots of reward
The agent must prefer past actions that have
been found to be effective at producing reward
The agent must exploit what it already knows
to obtain reward
The agent must select untested actions to
discover reward-producing actions
The agent must explore actions to make better
action selections in the future
Trade-off between exploration and exploitation
Passive learning v.s. Active
learning
Passive learning
The agent imply watches the world going by
and tries to learn the utilities of being in
various states
Active learning
The agent not simply watches, but also acts
Passive learning scenario

The agent see the sequences of state


transitions and associate rewards
The environment generates state transitions
and the agent perceive them

Key idea: updating the utility value using


the given training sequences.
Reinforcement Learning
Systems
Reinforcement learning systems have 4
main elements:
Policy
Reward signal
Value function
Optional model of the environment
Model based v.s.Model free
approaches
 But, we don’t know anything about the environment
model—the transition function T(s,a,s’)
 Here comes two approaches
 Model based approach RL:
learn the model, and use it to derive the optimal policy.
e.g Adaptive dynamic learning(ADP) approach

 Model free approach RL:


derive the optimal policy without learning the model.
e.g LMS and Temporal difference approach
Model-free versus Model-
based
 A model of the environment allows inferences to be
made about how the environment will behave
 Example: Given a state and an action to be taken
while in that state, the model could predict the next
state and the next reward
 Models are used for planning, which means deciding
on a course of action by considering possible future
situations before they are experienced
 Model-based methods use models and planning. Think
of this as modelling the dynamics p(s’ | s, a)
 Model-free methods learn exclusively from trial-and-
error (i.e. no modelling of the environment)
 This presentation focuses on model-free methods
Policy
A policy is a mapping from the perceived
states of the environment to actions to be
taken when in those states
A reinforcement learning agent uses a
policy to select actions given in the current
environment state
On-policy versus Off-policy
An on-policy agent learns only about the
policy that it is executing
An off-policy agent learns about a policy or
policies different from the one that it is
executing
Credit Assignment Problem
Given a sequence of states and actions,
and the final sum of time-discounted future
rewards, how do we infer which actions
were effective at producing lots of reward
and which actions were not effective?
How do we assign credit for the observed
rewards given a sequence of actions over
time?
Every reinforcement learning algorithm
must address this problem
Reward Design
We need rewards to guide the agent to
achieve its goal
Option 1: Hand-designed reward functions
Option 2: Learn rewards from
demonstrations
Instead of having a human expert tune a
system to achieve the desired behaviour, the
expert can demonstrate desired behaviour
and the robot can tune itself to match the
demonstration
Reward Signal
The reward signal defines the goal
On each time step, the environment sends
a single number called the reward to the
reinforcement learning agent
The agent’s objective is to maximise the
total reward that it receives over the long
run
The reward signal is used to alter the policy
Value Function (1)
The reward signal indicates what is good in
the short run while the value function
indicates what is good in the long run
The value of a state is the total amount of
reward an agent can expect to accumulate
over the future, starting in that state
Compute the value using the states that
are likely to follow the current state and the
rewards available in those states
Future rewards may be time-discounted
with a factor in the interval [0, 1]
Use the values to make and evaluate
decisions
Action choices are made based on value
judgements
Prefer actions that bring about states of
highest value instead of highest reward
Rewards are given directly by the
environment
Values must continually be re-estimated
from the sequence of observations that an
agent makes over its lifetime
What is Deep Reinforcement
Learning?
 Deep reinforcement learning is standard reinforcement
learning where a deep neural network is used to
approximate either a policy or a value function
 Deep neural networks require lots of real/simulated
interaction with the environment to learn
 Lots of trials/interactions are possible in simulated
environments
 We can easily parallelise the trials/interaction in
simulated environments
 We cannot do this with robotics (no simulations)
because action execution takes time,
accidents/failures are expensive and there are safety
concerns
Summary
Get faster training because of parallelism
Can use on-policy reinforcement learning
methods
Diversity in exploration can lead to better
performance than synchronous methods
In practice, the on-policy A3C algorithm
appears to be the best performing
asynchronous reinforcement learning
method in terms of performance and
training speed

You might also like