0% found this document useful (0 votes)
57 views82 pages

Lecture 9 - RL

This document contains the schedule for an introduction to artificial intelligence course. It lists the topics that will be covered each week over 17 weeks. It notes assignments that will be given and due dates. The intended learning outcomes are also listed which are to understand definitions of AI, challenges in the field, modeling techniques, applying models to solve problems, and developing a final project solution.

Uploaded by

鄭博仁
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views82 pages

Lecture 9 - RL

This document contains the schedule for an introduction to artificial intelligence course. It lists the topics that will be covered each week over 17 weeks. It notes assignments that will be given and due dates. The intended learning outcomes are also listed which are to understand definitions of AI, challenges in the field, modeling techniques, applying models to solve problems, and developing a final project solution.

Uploaded by

鄭博仁
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 82

Any Questions?

Introduction to Artificial Intelligence


Week Date Topic Note
1 2/13~2/17 Introduction
2 2/20~2/24 Machine Learning I
3 2/27~3/3 Machine Learning II No Class (2/28)
4 3/6~3/10 Machine Learning III HW1 announce (3/6)
5 3/13~3/17 Problem Solving by Searching
6 3/20~3/24 Adversarial Search HW1 due and HW2 announce (3/21)
7 3/27~3/31 Markov Decision Process
No Class (4/4)
8 4/3~4/7 Spring Break
HW2 due and HW3 announce (4/7)
9 4/10~4/14 Reinforcement Learning
10 4/17~4/21 Constraint Satisfaction Problems HW3 due and HW4 announce (4/21)
11 4/24~4/28 Bayesian Network
12 5/1~5/5 Knowledge, Reasoning, and Planning 
13 5/8~5/12 3D Computer Vision HW4 due and HW5 announce (5/12)
14 5/15~5/19 Guest Talk (NVIDIA Research)
15 5/22~5/26 Guest Talk (TBA) HW5 due (5/26)
16 5/29~6/2 Guest Talk (NVIDIA Research)
17 6/5~6/9 *Final Project Demo* *The above dates might change

18 6/13~6/17

Introduction to Artificial Intelligence 2


https://mingyuliu.net/

Introduction to Artificial Intelligence


https://youtu.be/3GPNsPMqY8o

Introduction to Artificial Intelligence


Intended Learning Outcomes
• List existing definitions of artificial intelligence
• First lecture
• Identify challenges in the field of artificial intelligence
• Cover them in this semester
• Explain selected modeling for identified problems
• A* search (state-based model) for route planning (problem)
• Apply the learned modeling to solve problems
• HW1 ~ HW5
• Develop your solution for a problem
• Final Project

Introduction to Artificial Intelligence


Machine Learning
• Minimizing an objective on the given datasets (or called training
dataset)

Introduction to Artificial Intelligence


State-based Models

Search problem Adversarial Search MDPs


Minimax,
UCS and A* search Alpha-Beta pruning Policy evaluation
Expectimax Value iteration

Introduction to Artificial Intelligence


Example: Map Coloring

How to solve this problem?


Introduction to Artificial Intelligence
Variable-based Models
• Variables
• Solutions to problems ==> assignments to variables
• Examples of variable-based models
• Constraint satisfaction problems
• Bayesian Networks
• Hidden Markov models

Introduction to Artificial Intelligence


The Robot System
A hierarchical design with the following five major levels

1. The robot vehicle and its connection to user programs


• Chapter 2 and 3, and Appendix A and G
2. Low-level action (LLA) (e.g., roll)
• Chapter 4
3. Intermediate-level action (ILA) (e.g., push)
• Chapter 5, 6, and Appendix B
4. Planning (i.e., STRIPS)
• Construct sequences of ILAs needed to carry out specific tasks
• Chapter 7
5. The executive, the program that actually invokes and monitors executions of the ILAs
• Chapter 8
http://ai.stanford.edu/~nilsson/OnlinePubs-Nils/shakey-the-robot.pdf

Introduction to Artificial Intelligence


A Task for Shakey
Fetch a box from an adjacent room and take it back to Room1
How to achieve
it?
Intermediate-level Action
1. Robot GOTHRU D1 from R1 to
R2
2. Robot PUSHTHRU Box1
through D1 from R2 to R1
How to obtain this conclusion?

Introduction to Artificial Intelligence


ChatGPT for Robot Action Generation?

https://github.com/microsoft/ChatGPT-Robot-Manipulation-Prompts

Introduction to Artificial Intelligence


Introduction to Artificial Intelligence
Introduction to Artificial Intelligence
Introduction to Artificial Intelligence
Reinforcement Learning
Spring 2023

Yi-Ting Chen
Introduction to Artificial Intelligence https://lilianweng.github.io/posts/2018-02-19-rl-overview/
AlphaGo, AlphaZero
• Deep Mind, 2016+

Introduction to Artificial Intelligence


RL in Intelligent Systems

https://www.vis.xyz/pub/roach/ https://rl-at-scale.github.io/

Introduction to Artificial Intelligence


End-to-end Driving Models
• Pros:
• Jointly optimizing the targeted objective function (i.e., safe,
comfortable, and efficient driving experience)
• Cheap annotations
• Cons:
• Generalization
• Interpretability

https://www.cvlibs.net/publications/Renz2022CORL.pdf

Introduction to Artificial Intelligence 19


End-to-end Driving Models

https://github.com/autonomousvision/transfuser

Introduction to Artificial Intelligence


IL-based End-to-end Driving Models: Challenges
To train a smart end-to-end driving model with Imitation Learning (IL)
approach, we need lots of high quality expert demonstrations.
● Human drivers as experts
○ High quality data
○ Too expensive to collect lots of data😵
● Automated experts (rule-based)
○ Efficiently generate large scale demonstrations on driving
simulators
○ Hand-crafted rule-based experts usually perform suboptimally😵
We need another expert!

Introduction to Artificial Intelligence


RL-based End-to-end Driving Expert
Why is it desirable to introduce Reinforcement Learning (RL) as a
driving expert?
● Pros
○ RL-based expert is automated, so it can be used to generate large scale
demonstrations efficiently in driving simulators
● Cons
○ RL-based end-to-end driving models are limited to driving simulators
○ Training RL-based models from scratch in the real world leads to fatal car
accidents

Introduction to Artificial Intelligence


Introduction to Artificial Intelligence
https://www.unrealengine.com/marketplace/en-US/product/japanese-street/reviews

Introduction to Artificial Intelligence


Roach - the Reinforcement Learning Coach
STEP 1: Train RL expert
Input: bird’s-eye view semantic segmentation image
Output: low-level actions (steering, throttle and brake)

https://arxiv.org/pdf/2108.08265.pdf

Introduction to Artificial Intelligence


https://arxiv.org/pdf/2108.08265.pdf
26
Introduction to Artificial Intelligence
Roach - the Reinforcement Learning Coach (cont.)
STEP 2: Train IL agent
Input: single camera image
Output: low-level actions

Imitation learning agents learn


from supervision signals provided
by RL experts

https://arxiv.org/pdf/2108.08265.pdf

Introduction to Artificial Intelligence


Performance Comparison - IL Trained by Different Experts

https://arxiv.org/pdf/2108.08265.pdf

Introduction to Artificial Intelligence


Demo: IL Agent Supervised by RL Expert

Introduction to Artificial Intelligence


IL-based End-to-end Driving Models: Challenges
To train a smart end-to-end driving model with Imitation Learning (IL)
approach, we need lots of high quality expert demonstrations.
● Human drivers as experts
○ High quality data
○ Too expensive to collect lots of data😵
● Automated experts (rule-based)
○ Efficiently generate large scale demonstrations on driving
simulators
○ Hand-crafted rule-based experts usually perform suboptimally😵
We need another expert!

Introduction to Artificial Intelligence


Any Questions?

Introduction to Artificial Intelligence


Wayve

https://youtu.be/Va-F4qtTQ6g

Introduction to Artificial Intelligence 32


Introduction to Artificial Intelligence 33
Deep RL at Scale: Sorting Waste in Office
Buildings with a Fleet of Mobile Manipulators

https://ai.googleblog.com/2023/04/robotic-
deep-rl-at-scale-sorting-waste.html https://rl-at-scale.github.io/

Introduction to Artificial Intelligence


The Hanging Task

reaching grasping moving hanging

Introduction to Artificial Intelligence


6D Pose Estimation: Template Pose Matching

● How to obtain the 6D pose of an object?


○ Template pose matching

capture keypoint
object template 6D pose
RGBD annotation

corresponding
points matching
obtain 6D pose

capture keypoint
unknown object pose RGBD annotation

Introduction to Artificial Intelligence


Grasping: other approaches
● How to obtain the 6D pose of an object?
○ Learning-based methods

https://arxiv.org/pdf/2104.01542v2.pdf

Introduction to Artificial Intelligence


Deep RL at Scale:
How can the various tools in the deep reinforcement learning toolbox be
put together to address such a complex and diverse real-world task?
• the use of simulation to overcome the high
sample complexity
• Hard to characterize every scenes in
advance
• the use of prior data
• computer vision datasets or data from the
Internet
• large-scale collective learning involving fleets
of multiple robots
• training from scratch in the real world
presents a major exploration challenge,
https://rl-at-scale.github.io/
https://rl-at-scale.github.io/assets/rl_at_scale.pdf

Introduction to Artificial Intelligence


Markov Decision Processes (MDPs)
s
• A MDP is defined as:
• A set of states s  S
• A set of actions a  A a
• A transition function T(s, a, s’)
• Probability that a from s leads to s’, i.e., P(s’| s, a) s, a
• Also called the dynamics model
• A start state
• Maybe a terminal state s,a,s’
• A reward function R(s, a, s’)
• reward for the transition (s, a, s’) ’s

RL: Unknown Transition Functions or Rewards


Introduction to Artificial Intelligence
A computational approach to learning
from Interactions
• Reinforcement learning problem:
• Learning what to do – how to map situations to actions – so as to
maximize a numerical reward signal
• Formalize the problem via dynamical systems theory:
• Markov Decision Processes (MDPs)

Introduction to Artificial Intelligence


MDPs vs RL
• MDPs (Offline)
• Known transition model
• Find a policy to maximize expected utility

• RL (Online)
• Unknow transition models
• Perform actions in the world to find out and collect rewards
• The way humans work: we go through life, taking various actions,
getting feedback. We get rewarded for doing well and learn along the
way

Introduction to Artificial Intelligence


Reinforcement Learning Framework

Stanford CS221n

Introduction to Artificial Intelligence


MDP Dynamics
Known?
Yes No

Iterative methods (e.g., value Learn a Model of


iteration and policy evaluation) Dynamics?

Yes No

State Space with


Model-based RL
manageable size?
Yes No

Hand-crafted state
Model free RL
features?
Yes
No
Model free RL with
Deep Reinforcement
function
Learning
approximation

Introduction to Artificial Intelligence


Model-based RL
• A MDP is defined as:
• A set of states s  S
• A set of actions a  A
• A transition function T(s, a, s’)
• Probability that a from s leads to s’, i.e., P(s’| s, a)
• Also called the dynamics model
• A start state
• Maybe a terminal state
• A reward function R(s, a, s’)
• reward for the transition (s, a, s’)

Estimate transition function and reward in MDP!

Introduction to Artificial Intelligence


Model-based Monte Carlo
• Monte Carlo is a standard way to estimate the expectation of a
random variable by taking an average over samples of that
random variable
• Data collected by using policies

• Transition:

• Reward:

Introduction to Artificial Intelligence


Example
Input Policy  Observed Episodes (Training) Learned Model
Episode 1 Episode 2 T(s,a,s’).
B, east, C, -1 B, east, C, -1 T(B, east, C) = 1.00
A T(C, east, D) = 0.75
C, east, D, -1 C, east, D, -1
T(C, east, A) = 0.25
D, exit, x, +10 D, exit, x, +10
B C D …

Episode 3 Episode 4 R(s,a,s’).


E R(B, east, C) = -1
E, north, C, -1 E, north, C, -1
R(C, east, D) = -1
C, east, D, -1 C, east, A, -1 R(D, exit, x) = +10
Assume:  = 1 D, exit, x, +10 A, exit, x, -10 …
Berkeley CS118

Introduction to Artificial Intelligence


Model-based Monte Carlo
• Monte Carlo is a standard way to estimate the expectation of a
random variable by taking an average over samples of that
random variable
• Data collected by using policies

• Transition:

• Reward:
Once we know transition and reward, iterative methods
can be applied to find optimal policy
Introduction to Artificial Intelligence
Problem
• If we do not observe a pair (s, a), e.g., (A, ‘south’), we cannot
estimate transition function and the corresponding reward
• In reinforcement learning, we need a policy to explore states
• This is the concept of Exploration!
• Will discuss the idea later
• This distinguishes reinforcement learning from supervised learning

Introduction to Artificial Intelligence


MDP Dynamics
Known?
Yes No

Iterative methods (e.g., value Learn a Model of


iteration and policy evaluation) Dynamics?

Yes No

State Space with


Model-based RL
manageable size?
Yes No

Hand-crafted state
Model free RL
features?
Yes
No
Model free RL with
Deep Reinforcement
function
Learning
approximation

Introduction to Artificial Intelligence


Model-free RL
• Estimate optimal value directly
• we do not explicitly estimate the transitions and rewards
• Instead, we estimate directly

Introduction to Artificial Intelligence


Model-free Montel Carlo
• Data (following a policy )

• is the expected utility of taking action from state , and then


following policy 
• Utility:
𝑈 𝑡 =𝑅𝑡 +1 +𝛾 𝑅𝑡 +2 +𝛾 2 𝑅𝑡 +3 +…
• Estimate:

Introduction to Artificial Intelligence


Output Values
Example
Input Policy  Observed Episodes (Training) -10
A
Episode 1 Episode 2 +8 +4 +10
B, east, C, -1 B, east, C, -1 B C D
A
C, east, D, -1 C, east, D, -1 -2
D, exit, x, +10 D, exit, x, +10 E
B C D
Episode 3 Episode 4 Example C
E E1: (-1 + 10) = 9
E, north, C, -1 E, north, C, -1 E2: (-1 + 10) = 9
E3: (-1 + 10) = 9
C, east, D, -1 C, east, A, -1
Assume:  = 1 E4: (-1 - 10) = -11
D, exit, x, +10 A, exit, x, -10
Avg: (9+9+9-11)/4 = 4

Introduction to Artificial Intelligence


On-policy vs off-policy
• Model-based Monte Carlo is off-policy
• the model does not depend on an exact policy
• as long as it was able to explore all (s, a) pairs

• Model-free Monte Carlo depends strongly on the policy that is


followed
• the value being computed is dependent on the policy used to generate
the data

Introduction to Artificial Intelligence


Problem
• We use as an estimate of

• only corresponds to a single episode

• Could we do better in estimating ?

Introduction to Artificial Intelligence


A Convex Formulation of Q-value
Estimate
• Original Formulation:

• Iterative procedure for building the mean as new numbers are


coming in

Introduction to Artificial Intelligence


Example -10
A
+8 +4 +10
B C D
Iteration 1: -2
E
Iteration 2: Example
C
E1: (-1 + 10) = 9
E2: (-1 + 10) = 9
Iteration 3: E3: (-1 + 10) = 9
E4: (-1 - 10) = -11

Avg: (9+9+9-11)/4 = 4
Iteration 4:
Introduction to Artificial Intelligence
SARSA
• A combination of the data and the estimate of (based on the
data that has seen before)

Introduction to Artificial Intelligence


Problem
• Model-free Monte Carlo and SARSA only give us
• How to obtain to act optimally?

Introduction to Artificial Intelligence


Optimal Q-value

Optimal Q-value if take action in state

How to obtain optimal Q-value in a model-free manner?

Introduction to Artificial Intelligence


Q-Learning
Optimal Q-value:

• The difference between Q-learning and SARSA is that involves a maximum over
actions rather than just taking the action of the policy

Stanford CS221n

Introduction to Artificial Intelligence


Q-learning

Stanford CS221n

Introduction to Artificial Intelligence


Markov Decision Processes (MDPs)
s
• A MDP is defined as:
• A set of states s  S
• A set of actions a  A a
• A transition function T(s, a, s’)
• Probability that a from s leads to s’, i.e., P(s’| s, a) s, a
• Also called the dynamics model
• A start state
• Maybe a terminal state s,a,s’
• A reward function R(s, a, s’)
• reward for the transition (s, a, s’) ’s

RL: Unknown Transition Functions or Rewards


Introduction to Artificial Intelligence
Self-driving Cab
• How to use RL techniques to pick up the passenger at one location and
drop them off in another?
• Drop off the passenger to the right location.
• Save passenger's time by taking minimum time possible to drop off
• Take care of passenger's safety and traffic rules

Introduction to Artificial Intelligence


State Spaces
• Current location of the cab: 5 x 5 grid (5x5)
• Passenger location: R, G, B, Y + inside the cab (5)
• Destination: R, G, B, Y (4)

• In total, we have 5x5x5x4 = 500 states

Introduction to Artificial Intelligence


Action Spaces
• South, north, east, west, pickup, and dropoff

• Note that there are certain actions that the cab


cannot perform due to obstacles (e.g., walls)

Introduction to Artificial Intelligence


Rewards
• The agent should receive a high positive reward for a successful
dropoff because this behavior is highly desired
• The agent should be penalized if it tries to drop off a passenger in
wrong locations
• The agent should get a slight negative reward for not making it to the
destination after every time-step. "Slight" negative because we would
prefer our agent to reach late instead of making wrong moves trying to
reach to the destination as fast as possible

Introduction to Artificial Intelligence


Q-learning

Introduction to Artificial Intelligence


Any Questions?

Introduction to Artificial Intelligence


Is Q-learning on-policy or off-policy?

Model-free Q-learning is off-policy, since it can learn the optimal


policy using data from other policies
Stanford CS221n
Introduction to Artificial Intelligence
Different RL Algorithms

Introduction to Artificial Intelligence


How to determine the exploration policy?

Determine an exploration policy to act to collect data for learning

Introduction to Artificial Intelligence


Exploration Policy
• No Exploration, All Exploitation
• Set
• Once the agent finds an optimal policy, they will not try other actions
• No Exploitation, all exploration
• Set random from Actions (s)
• it doesn't exploit what it learns and ends up with a very low utility

Introduction to Artificial Intelligence


Epsilon-greedy Algorithm
• The natural thing to do when you have two extremes is to
interpolate between the two
• Start with high

Introduction to Artificial Intelligence


MDP Dynamics
Known?
Yes No

Iterative methods (e.g., value Learn a Model of


iteration and policy evaluation) Dynamics?

Yes No

State Space with


Model-based RL
manageable size?
Yes No

Hand-crafted state
Model free RL
features?
Yes
No
Model free RL with
Deep Reinforcement
function
Learning
approximation

Introduction to Artificial Intelligence


In Q-learning
• Basic Q-Learning keeps a table of all q-values
• In realistic situations, we cannot possibly learn about every
single state!
• Too many states to visit them all in training
• Too many states to hold the q-tables in memory

Introduction to Artificial Intelligence


Similar to the evaluation
function discussed in the
Function Approximation Adversarial Search

• Function approximation fixes the issue of large lookup tables by


parameterizing by a weight vector and a feature vector

Do you remember where we learn this?

Introduction to Artificial Intelligence


Q-learning with Function Approximation
• We hope by using some parametric functions
• Loss function: MSE

• Gradient descent (partial derivative with respect to )

Objective Function

Introduction to Artificial Intelligence


Gradient Decent
• Objective function:
1
𝑇𝑟𝑎𝑖𝑛𝐿𝑜𝑠𝑠 ( 𝜃 )=
¿ 𝒟 train ∨¿ ∑ (𝜃 𝜙 (𝑥 )− 𝑦 ) ¿
2

(𝑥, 𝑦)∈𝒟 train

• Partial derivative with respect to :


1
∇ 𝜃 𝑇𝑟𝑎𝑖𝑛𝐿𝑜𝑠𝑠 ( 𝜃 )=
¿ 𝒟 train∨¿ ∑ 2 ( 𝜃 𝜙(𝑥)− 𝑦 ) 𝜙 (𝑥 )¿
(𝑥 , 𝑦)∈𝒟 train
• Update:

Stanford CS231n Note

Introduction to Artificial Intelligence


MDP Dynamics
Known?
Yes No

Iterative methods (e.g., value Learn a Model of


iteration and policy evaluation) Dynamics?

Yes No

State Space with


Model-based RL
manageable size?
Yes No

Hand-crafted state
Model free RL
features?
Yes
No
Model free RL with
Deep Reinforcement
function
Learning
approximation

Introduction to Artificial Intelligence


Deep Reinforcement Learning
• Instead of manually determine the states, let’s take raw data
(i.e., images) as states

• Parameterize images and with neural networks


• Deep Q-Networks (Deepmind, 2015)

• The field of Deep RL is growing very fast!


• http://rail.eecs.berkeley.edu/deeprlcourse/

Introduction to Artificial Intelligence


Breakout Game
• Action:
• move_left
• move_right
• do_not_move
• Rewards
• If ball hit a brick, reward=1
• If not, reward=0
• End
• If ball falls off the screen, game ends

Introduction to Artificial Intelligence


Three Different Learning Paradigms
• Supervised Learning
• the model learns from training data that consists of a labeled pair of
inputs and outputs
• For example, in Chess, you have to all the moves a player can make in
each state, along with labels indicating whether it is a good move or not
• Unsupervised learning
• the model learns the hidden structure in the input data
• Reinforcement Learning
• the model learns by maximizing the cumulative reward
• the data is the Environment that you are interacting with

Introduction to Artificial Intelligence

You might also like