0% found this document useful (0 votes)
13 views52 pages

lect1_introRL

Rl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views52 pages

lect1_introRL

Rl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Introduction to

Reinforcement Learning
DAVIDE BACCIU – [email protected]
Preliminaries

DAVIDE BACCIU - UNIVERSITÀ DI PISA 2


The Course (3 CFU – ~24h)
✓RL Fundamentals
✓Markov Decision Processes
✓Planning by Dynamic Programming
✓Model-Free Prediction & Control

✓Value Function Methods Lectures by me


✓Policy Gradient Methods Guest lectures by Maurizio Parton
✓Exploration and Exploitation Student seminars
✓Deep reinforcement learning
✓Advanced topics and applications: continual learning, meta-learning, inverse RL, RL
frameworks, some case studies

DAVIDE BACCIU - UNIVERSITÀ DI PISA 3


Course Completion
✓ M.Sc. students
✓Prepare a 15 minutes presentation to be given in one of the 2 student lecture dates
✓ 5 minutes Q&A on the content of the presentation
✓ Seminar content
✓ Read 3 relevant papers on a topic of interest for the course; summarize their content and confront the methods
✓ Implement 1 RL method from literature and attempt a validation on a simple application; describe the model, its implementation
and the validation results

✓ Ph.D. Students
✓ Read 3 (or more) relevant papers on a topic of interest for the course and summarize their content
in a report (6-10 pages single column, NeurIPS format)
✓ Sketch/propose a novel RL method/application: report your idea in sufficient detail in a short
paper (6-10 pages single column, NeurIPS format)
✓ Implement a RL-based application and validate it: prepare a short presentation to report the
results (10-15 slides describing the model, the implementation and the results)
✓ Contact me and agree on alternative ways (e.g. using RL in your Ph.D. project, …)

DAVIDE BACCIU - UNIVERSITÀ DI PISA 4


Fill in your contact details here
https://docs.google.com/spreadsheets/d/10fcU3SVYb5wOYd-
c6uag9kk5spq6O1Q86m0fv2aSWsA/edit?usp=sharing

DAVIDE BACCIU - UNIVERSITÀ DI PISA 5


Resources
The course webpage (elearning.di.unipi.it/course/view.php?id=227)
✓Course calendar & news
✓Slides, video lectures, additional materials
✓Course assignment upload

Reference Book

Richard S. Sutton and Andrew G. Barto, Reinforcement


Learning: An Introduction, Second Edition, MIT Press
(available online)

DAVIDE BACCIU - UNIVERSITÀ DI PISA 6


A Note
Much of the content of this course and its slides are heavily
based on the classical course by David Silver
https://www.davidsilver.uk/teaching/

DAVIDE BACCIU - UNIVERSITÀ DI PISA 7


Introduction

DAVIDE BACCIU - UNIVERSITÀ DI PISA 8


Positioning
Reinforcement
Learning

DAVIDE BACCIU - UNIVERSITÀ DI PISA 9


What characterizes Reinforcement
Learning (vs other ML tasks)?
✓No supervisor: only a reward signal
✓Delayed asynchronous feedback
✓Time matters (sequential data, continual learning)
✓Agent’s actions affect the subsequent data it receives (inherent
non-stationarity)

DAVIDE BACCIU - UNIVERSITÀ DI PISA 10


Using Reinforcement Learning
✓Learning to maneuver vehicles
✓Learn to control robots (walking, navigation, manipulation)
✓Manage portfolios
✓Play games
✓Discover new molecules
✓End-to-end learning with discrete structures

DAVIDE BACCIU - UNIVERSITÀ DI PISA 11


Game Playing

DAVIDE BACCIU - UNIVERSITÀ DI PISA 12


Navigation

DAVIDE BACCIU - UNIVERSITÀ DI PISA 13


Manipulation

https://www.youtube.com/watch?v=jwSbzNHGflM

DAVIDE BACCIU - UNIVERSITÀ DI PISA 14


Formalizing
Reinforcement Learning

DAVIDE BACCIU - UNIVERSITÀ DI PISA 15


Rewards
✓A reward 𝑅𝑡 is a scalar feedback signal
✓Indicates how well agent is doing at step t
✓The agent’s job is to maximise cumulative reward
Reinforcement learning is based on the reward hypothesis
✓All goals can be described by the maximisation of expected
cumulative reward

DAVIDE BACCIU - UNIVERSITÀ DI PISA 16


What is a Reward?
✓Learning to drive a car (+ve reward for getting places safely - -ve reward for
crashing)
✓Make a humanoid robot walk (+ve reward for forward motion, -ve reward for
tripping over)
✓Make a robot arm manipulate objects (+ve reward for goal achievement, -ve
reward for object falling)
✓Manage an investment portfolio (+ve reward for each $ in bank)
✓Play games (+=-ve reward for increasing/decreasing score )
✓Discover new molecules (+ve reward for synthesizable molecule, -ve reward for
toxic molecule)

DAVIDE BACCIU - UNIVERSITÀ DI PISA 17


Sequential Decision Making
✓Goal: select actions to maximise total future reward
✓Actions may have long term consequences
✓Reward may be delayed
✓It may be better to sacrifice immediate reward to gain more long-term reward
✓Examples:
✓A financial investment (may take months to mature)
✓Refuelling a helicopter (might prevent a crash in several hours)
✓Blocking opponent moves (might help winning chances many moves from now)

DAVIDE BACCIU - UNIVERSITÀ DI PISA 18


Agent and
Environment
✓At each step t the agent:
✓ Executes action At
✓ Receives observation 𝑶𝒕
✓ Receives scalar reward 𝑹𝒕

✓The Environment:
✓ Receives action At
✓ Emits observation 𝑶𝒕+𝟏
✓ Emits scalar reward 𝑹𝒕+𝟏

✓t increments at environment step

DAVIDE BACCIU - UNIVERSITÀ DI PISA 19


History and State
The history is the sequence of observations, actions, rewards
𝐻𝑡 = 𝑂1 ; 𝑅1 ; 𝐴1 … 𝐴𝑡−1 ; 𝑂𝑟 ; 𝑅𝑡
✓i.e. all observable variables up to time t
✓i.e. the sensorimotor stream of a robot or embodied agent

What happens next depends on the history:


✓The agent selects actions
✓The environment selects observations/rewards

State 𝑆𝑡 is the information used to determine what happens next and is a


function of history
𝑆𝑡 = 𝑓(𝐻𝑡 )

DAVIDE BACCIU - UNIVERSITÀ DI PISA 20


Environment State
✓The environment state 𝑆𝑡𝑒 is the
environment 𝑒 private representation at
time 𝑡
✓Whatever information the environment uses
to generate the next observation/reward

✓The environment state is not usually


visible to the agent (unobservable
environment)
✓ Even if 𝑆𝑡𝑒 is visible, it may contain
irrelevant information

DAVIDE BACCIU - UNIVERSITÀ DI PISA 21


Agent State
✓The agent state 𝑆𝑡𝑎 the internal
representation owned by agent 𝑎
✓Whatever information the agent uses
to select next action
✓The agent state is the information
used by reinforcement learning
algorithms
✓Generally speaking a function of
history
𝑆𝑡𝑎 = 𝑓(𝐻𝑡 )
DAVIDE BACCIU - UNIVERSITÀ DI PISA 22
Information (Markov) State
An information state (Markov state) contains all useful information from
the history
Definition (Markov State)
A state 𝑆𝑡 is Markov if and only if
𝑃(𝑆𝑡+1 𝑆1 , … , 𝑆𝑡 = 𝑃(𝑆𝑡+1 |𝑆𝑡 )

✓The future is independent of the past given present (d-separation)


𝐻1:𝑡 → 𝑆𝑡 → 𝐻𝑡+1:∞
✓The state is a sufficient statistics for the future
✓The environment state 𝑆𝑡𝑒 is Markov
✓The history 𝐻𝑡 is Markov
DAVIDE BACCIU - UNIVERSITÀ DI PISA 23
What’s the best (agent) state model?

DAVIDE BACCIU - UNIVERSITÀ DI PISA 24


Fully Observable Environment
✓Full observability ⟹ Agent
directly observes the environment
state
𝑂𝑡 = 𝑆𝑡𝑎 = 𝑆𝑡𝑒
✓Formally this is a Markov Decision
Process (MDP)
✓Next lecture (and much of the RL
literature)

DAVIDE BACCIU - UNIVERSITÀ DI PISA 25


Partially Observable Environment
✓Partial observability ⟹ Agent indirectly observes the environment
✓A robot with camera vision only may not know absolute location
✓A trading agent only observes current prices
✓A poker player only observes public cards
✓Formally 𝑆𝑡𝑎 ≠ 𝑆𝑡𝑒 and the problem is a Partially Observable Markov Decision
Process (POMDP)
✓The agent needs to build its own state representation 𝑆𝑡𝑎
✓History: 𝑆𝑡𝑎 = 𝐻𝑡
✓Beliefs on environment state: 𝑆𝑡𝑎 = [𝑃(𝑆𝑡𝑒 = 𝑠1 ) … 𝑃(𝑆𝑡𝑒 = 𝑠 𝑁 )]
𝑎
✓A dynamic memory (RNN): 𝑆𝑡𝑎 = 𝜎(𝑊𝑠 𝑆𝑡−1 + 𝑊𝑜 𝑂𝑡 )

DAVIDE BACCIU - UNIVERSITÀ DI PISA 26


Components of a
Reinforcement Learning
Agent

DAVIDE BACCIU - UNIVERSITÀ DI PISA 27


Key Components of an RL Agent
✓Policy: agent’s behaviour function
✓Value function: how good is each state and/or action
✓Model: agent’s representation of the environment

An RL agent may include one or more of the above

DAVIDE BACCIU - UNIVERSITÀ DI PISA 28


Policy
✓A policy 𝜋 is the agent’s behaviour
✓It is a map from state 𝑠 to action 𝑎
✓Deterministic policy: a = 𝜋(𝑠)
✓Stochastic policy: 𝜋 𝑎 𝑠 = 𝑃(𝐴𝑡 = 𝑎|𝑆𝑡 = 𝑠)

DAVIDE BACCIU - UNIVERSITÀ DI PISA 29


Value Function
How “good” is a specific
state/action for an agent?

DAVIDE BACCIU - UNIVERSITÀ DI PISA 30


Value Function
✓The value function 𝑣 is a predictor of future reward
✓Used to evaluate the goodness/badness of states
✓And therefore to select between actions, e.g
𝑣𝜋 (𝑠) = 𝔼𝜋 [𝑅𝑡+1 + 𝛾𝑅𝑡+2 + 𝛾 2 𝑅𝑡+2 + ⋯ |𝑆𝑡 = 𝑠]

Expected (discounted) future reward following policy 𝜋 from state 𝑠

DAVIDE BACCIU - UNIVERSITÀ DI PISA 31


Model
✓A model predicts what the environment will do
next
✓Predict next state 𝑠′ following an action 𝑎
𝑎
𝒫𝑠𝑠′ = 𝑃(𝑆𝑡+1 = 𝑠 ′ |𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎)
✓Predict next reward
ℛ𝑠𝑎 = 𝔼[𝑅𝑡+1 |𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎]

DAVIDE BACCIU - UNIVERSITÀ DI PISA 32


A Forever Classic - The Maze Example
✓ Rewards: -1 per
time-step
✓Actions: N, E, S, W
✓States: Agent
location

DAVIDE BACCIU - UNIVERSITÀ DI PISA 33


Maze Example
(Policy)
Arrows represent
policy 𝜋(𝑠) for each
state 𝑠

DAVIDE BACCIU - UNIVERSITÀ DI PISA 34


Maze Example
(Value Function)
Numbers denote the
value v𝜋 𝑠 for each 𝑠
Expected time to reach
the goal

DAVIDE BACCIU - UNIVERSITÀ DI PISA 35


Maze Example
(Model)
✓Agent may have an internal
(imperfect) model of the
environment
✓How actions change the state
✓How much reward from each state
✓Grid Layout: transition model
𝑎
𝒫𝑠𝑠 ′

✓ Numbers: immediate reward


model ℛ𝑠𝑎
DAVIDE BACCIU - UNIVERSITÀ DI PISA 36
Characterizing RL Agents (I)
✓Value Based
✓Policy (Implicit)
✓Value Function
✓Policy Based
✓Policy
✓Value Function
✓Actor Critic
✓Policy
✓Value Function

DAVIDE BACCIU - UNIVERSITÀ DI PISA 37


Characterizing RL Agents (II)
✓Model Free
✓Model
✓Policy and/or Value Function
✓Model Based
✓Model
✓Policy and/or Value Function

DAVIDE BACCIU - UNIVERSITÀ DI PISA 38


A Taxonomy

DAVIDE BACCIU - UNIVERSITÀ DI PISA 39


Problems within
Reinforcement Learning

DAVIDE BACCIU - UNIVERSITÀ DI PISA 40


Learning and Planning
Two fundamental problems in sequential decision making
✓Reinforcement Learning
✓The environment is initially unknown
✓The agent interacts with the environment
✓The agent improves its policy
✓Planning (reasoning, introspection, search,…)
✓A model of the environment is known
✓The agent performs computations with its model (no external interaction)
✓The agent improves its policy

DAVIDE BACCIU - UNIVERSITÀ DI PISA 41


Atari Example: Reinforcement Learning

Atari Example – Reinforcement learning

✓Rules of the game are


unknown
✓Learn directly from
interactive game-play
✓Pick actions on joystick,
see pixels and scores

DAVIDE BACCIU - UNIVERSITÀ DI PISA 42


Atari Example: Reinforcement Learning

Atari Example – Planning


✓Rules of the game are known
✓Agent contains emulator (model)
✓If I take action 𝑎 from state 𝑠:
✓what would the next state be?
✓what would the score be?
✓Plan ahead to find optimal policy
✓e.g. tree search

DAVIDE BACCIU - UNIVERSITÀ DI PISA 43


Exploration Vs Exploitation
✓Reinforcement Learning follows a trial-and-error process
✓The agent should discover a good policy
✓From its experiences of the environment
✓Without losing too much reward along the way

✓Exploration finds more information about the environment


✓Exploitation exploits known information to maximise reward
Effective reinforcement learning requires to trade
between exploration and exploitation

DAVIDE BACCIU - UNIVERSITÀ DI PISA 44


Examples
✓Restaurant Selection
✓Exploitation - Go to your favourite restaurant
✓Exploration - Try a new restaurant

✓Holiday planning
✓Exploitation – The camping site you go to since you are born
✓Exploration – Hitchhike and follow the flow

✓Game Playing
✓Exploitation - Play the move you believe is best
✓Exploration - Play an experimental move

DAVIDE BACCIU - UNIVERSITÀ DI PISA 45


Prediction & Control
✓Prediction: evaluate the future
✓Given a policy
✓Control: optimise the future
✓Find the best policy

DAVIDE BACCIU - UNIVERSITÀ DI PISA 46


Gridworld Example - Prediction

What is the value function for the uniform random policy?

DAVIDE BACCIU - UNIVERSITÀ DI PISA 47


Gridworld Example - Control

𝑣∗ 𝜋∗
What is the optimal value function over all possible policies?
What is the optimal policy?

DAVIDE BACCIU - UNIVERSITÀ DI PISA 48


Wrap-up

DAVIDE BACCIU - UNIVERSITÀ DI PISA 49


Take home messages
✓Reinforcement learning is a general-purpose framework for decision-
making
✓Reinforcement learning is for an agent with the capacity to act and
observe
✓ The state is the sufficient statistics to characterize the future
✓Depends on the history of actions and observations
✓Environment state Vs Agent state
✓Success is measured by a scalar reward signal
✓The goal is to select actions to maximise future reward (exploit)
✓In order to be effective we should not forget to explore

DAVIDE BACCIU - UNIVERSITÀ DI PISA 50


Next Lecture
Markov Decision Processes (MDP)
✓A formal model for (observable) environments in reinforcement
learning
✓Foundational approach to reinforcement learning
✓Markov property and processes
✓Discounted rewards
✓Bellman expectation
✓Extending to partially-observable environments

DAVIDE BACCIU - UNIVERSITÀ DI PISA 51


Short-term Calendar (tentative)
✓L2 – Thursday 01 Apr h. 14-16
✓L3 - Wednesday 7 Apr h. 14-16
✓L4 – Tuesday 27 Apr h. 14-16
✓L5 - Tuesday 04 May h. 14-16
✓L6 - Monday 10 May h. 16-18
✓L7-L9 – Guest lecture by Maurizio Parton

DAVIDE BACCIU - UNIVERSITÀ DI PISA 52

You might also like