Lecture 9 - RL
Lecture 9 - RL
18 6/13~6/17
https://github.com/microsoft/ChatGPT-Robot-Manipulation-Prompts
Yi-Ting Chen
Introduction to Artificial Intelligence https://lilianweng.github.io/posts/2018-02-19-rl-overview/
AlphaGo, AlphaZero
• Deep Mind, 2016+
https://www.vis.xyz/pub/roach/ https://rl-at-scale.github.io/
https://www.cvlibs.net/publications/Renz2022CORL.pdf
https://github.com/autonomousvision/transfuser
https://arxiv.org/pdf/2108.08265.pdf
https://arxiv.org/pdf/2108.08265.pdf
https://arxiv.org/pdf/2108.08265.pdf
https://youtu.be/Va-F4qtTQ6g
https://ai.googleblog.com/2023/04/robotic-
deep-rl-at-scale-sorting-waste.html https://rl-at-scale.github.io/
capture keypoint
object template 6D pose
RGBD annotation
corresponding
points matching
obtain 6D pose
capture keypoint
unknown object pose RGBD annotation
https://arxiv.org/pdf/2104.01542v2.pdf
• RL (Online)
• Unknow transition models
• Perform actions in the world to find out and collect rewards
• The way humans work: we go through life, taking various actions,
getting feedback. We get rewarded for doing well and learn along the
way
Stanford CS221n
Yes No
Hand-crafted state
Model free RL
features?
Yes
No
Model free RL with
Deep Reinforcement
function
Learning
approximation
• Transition:
• Reward:
• Transition:
• Reward:
Once we know transition and reward, iterative methods
can be applied to find optimal policy
Introduction to Artificial Intelligence
Problem
• If we do not observe a pair (s, a), e.g., (A, ‘south’), we cannot
estimate transition function and the corresponding reward
• In reinforcement learning, we need a policy to explore states
• This is the concept of Exploration!
• Will discuss the idea later
• This distinguishes reinforcement learning from supervised learning
Yes No
Hand-crafted state
Model free RL
features?
Yes
No
Model free RL with
Deep Reinforcement
function
Learning
approximation
Avg: (9+9+9-11)/4 = 4
Iteration 4:
Introduction to Artificial Intelligence
SARSA
• A combination of the data and the estimate of (based on the
data that has seen before)
• The difference between Q-learning and SARSA is that involves a maximum over
actions rather than just taking the action of the policy
Stanford CS221n
Stanford CS221n
Yes No
Hand-crafted state
Model free RL
features?
Yes
No
Model free RL with
Deep Reinforcement
function
Learning
approximation
Objective Function
Yes No
Hand-crafted state
Model free RL
features?
Yes
No
Model free RL with
Deep Reinforcement
function
Learning
approximation