Reinforcement Learning
Reinforcement Learning
Failure Success
Reinforcement Learning in Humans
• Human appear to learn to walk through “very few examples” of
trial and error. How is an open question…
• Possible answers:
• Hardware: 230 million years of bipedal movement data.
• Imitation Learning: Observation of other humans walking.
• Algorithms: Better than backpropagation and stochastic gradient descent
Environment
Open Question:
Sensors
What can be learned from
Sensor Data data?
Feature Extraction
Representation
Machine Learning
Knowledge
Reasoning
Planning
Action
Effector
Environment
Sensors
Sensor Data
Feature Extraction
Representation
Machine Learning
Knowledge
GPS
Reasoning Camera Radar
Lidar
(Visible, Infrared)
Planning
Action
Networking
Effector Stereo Camera Microphone IMU
(Wired, Wireless)
Environment
Sensors
Sensor Data
Feature Extraction
Representation
Machine Learning
Knowledge
Reasoning
Planning
Action
Effector
Environment
Sensors
Sensor Data
Feature Extraction
Representation
Machine Learning
Knowledge
Reasoning
Planning
Action
Effector
Environment
Image Recognition: Audio Recognition:
Sensors If it looks like a duck Quacks like a duck
Sensor Data
Feature Extraction
Representation
Activity Recognition:
Machine Learning Swims like a duck
Knowledge
Reasoning
Planning
Action
Effector
Environment
Sensors
Sensor Data
Feature Extraction
Representation
Machine Learning
Final breakthrough, 358 years after its conjecture:
Knowledge “It was so indescribably beautiful; it was so simple and
so elegant. I couldn’t understand how I’d missed it and
Reasoning I just stared at it in disbelief for twenty minutes. Then
during the day I walked around the department, and
Planning I’d keep coming back to my desk looking to see if it
was still there. It was still there. I couldn’t contain
Action myself, I was so excited. It was the most important
moment of my working life. Nothing I ever do again
Effector will mean as much."
Environment
Sensors
Sensor Data
Feature Extraction
Representation
Machine Learning
Knowledge
Reasoning
Planning
Action
Effector
Environment
Sensors
Sensor Data
Feature Extraction
The promise of
Deep Learning
Representation
Machine Learning
Knowledge
Reasoning
The promise of
Planning Deep Reinforcement Learning
Action
Effector
Terminologies
• Agent
• State
• Action
• Policy
• Reward
• State Transition
Terminologies
Reinforcement Learning
Framework
At each step, the agent:
• Executes action
• Observe new state
• Receive reward
Environment and Actions
• policy function (Pi) maps the state action pair to a probability score
between 0 and 1.
• Pi is a conditional probability density function
• In practice we don’t have the state transition function because of the randomness (like
gumba in our case). Only environment has this function.
Rewards and Returns
• Return (aka cumulative future reward)
• Ut is a return at time t
• Ut is a sum of all the future rewards (from time t till end of game)
• Discounted return is more popular than the return defined above
Discounted Returns
• If the future reward is equally important than the current reward the
value of gamma should be 1.
• If the future reward is less important than set gamma to a low
number.
• Note in the equation that the current reward is not discounted but
the future rewards are discounted.
Discounted Returns
• Discounted return is the weighted sum of the rewards from time t till
the end of the game.
• Suppose the game stops at time n, then the Discounted return is the
weighted sum of the rewards from time t till time n
Randomness in Returns
Randomness in Returns
Randomness in Returns - Observed ut
Randomness in Returns - Observed ut
• At time T, we have not observed rewards – Rt … Rn
• Rt … Rn are unknown random variables and are denoted by upper case letters
• Ut is a sum of Rt … Rn and hence unknown random variable.
Randomness in Returns - Observed ut
• Suppose the game has ended.
• At this time, we have observed all the rewards
• The rewards are therefore denoted by lower case letters.
• The sum of all the observed rewards, gives the return ut which is an observed value
• The ut is just a number. It does not have randomness.
Examples of Reinforcement Learning
Cart-Pole Balancing
• Goal — Balance the pole on top of a moving cart
• State — Pole angle, angular speed. Cart position, horizontal
velocity.
• Actions — horizontal force to the cart
• Reward — 1 at each time step if the pole is upright
Examples of Reinforcement Learning
Doom*
• Goal:
Eliminate all opponents
• State:
Raw game pixels of the game
• Actions:
Up, Down, Left, Right, Shoot, etc.
• Reward:
• Positive when eliminating an opponent,
negative when the agent is eliminated
Human Life
• Goal - Survival? Happiness?
• State - Sight. Hearing. Taste. Smell. Touch.
• Actions - Think. Move.
• Reward – Homeostasis?
3 Types of Reinforcement Learning
A1 A2 A3 A4
S1 +1 +2 -1 0
S2 +2 0 +1 -2
S3 -1 +1 0 -2
S4 -2 0 +1 +1
Example