0% found this document useful (0 votes)
35 views59 pages

Reinforcement Learning

The document discusses reinforcement learning as a method of teaching through experience, contrasting it with supervised learning which teaches by example. It outlines key concepts such as agents, states, actions, policies, and rewards, and explains the importance of maximizing cumulative rewards through various learning strategies. Additionally, it provides examples of reinforcement learning applications, including balancing a cart-pole, playing games, and human learning processes.

Uploaded by

Asma Ayub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views59 pages

Reinforcement Learning

The document discusses reinforcement learning as a method of teaching through experience, contrasting it with supervised learning which teaches by example. It outlines key concepts such as agents, states, actions, policies, and rewards, and explains the importance of maximizing cumulative rewards through various learning strategies. Additionally, it provides examples of reinforcement learning applications, including balancing a cart-pole, playing games, and human learning processes.

Uploaded by

Asma Ayub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Reinforcement Learning

Supervised learning is “teach by example”:


Here’s some examples, now learn patterns in these
example.

Reinforcement learning is “teach by experience”:


Here’s a world, now learn patterns by exploring it.
Machine Learning:
Supervised vs
Reinforcement
Supervised learning is “teach by example”:
Here’s some examples, now learn patterns in these example.
Reinforcement learning is “teach by experience”:
Here’s a world, now learn patterns by exploring it.

Failure Success
Reinforcement Learning in Humans
• Human appear to learn to walk through “very few examples” of
trial and error. How is an open question…
• Possible answers:
• Hardware: 230 million years of bipedal movement data.
• Imitation Learning: Observation of other humans walking.
• Algorithms: Better than backpropagation and stochastic gradient descent
Environment
Open Question:
Sensors
What can be learned from
Sensor Data data?
Feature Extraction

Representation

Machine Learning
Knowledge

Reasoning

Planning

Action

Effector
Environment

Sensors

Sensor Data

Feature Extraction

Representation

Machine Learning

Knowledge
GPS
Reasoning Camera Radar
Lidar
(Visible, Infrared)
Planning

Action
Networking
Effector Stereo Camera Microphone IMU
(Wired, Wireless)
Environment

Sensors

Sensor Data

Feature Extraction

Representation

Machine Learning
Knowledge

Reasoning

Planning

Action

Effector
Environment

Sensors

Sensor Data

Feature Extraction

Representation

Machine Learning

Knowledge

Reasoning
Planning

Action

Effector
Environment
Image Recognition: Audio Recognition:
Sensors If it looks like a duck Quacks like a duck

Sensor Data

Feature Extraction

Representation
Activity Recognition:
Machine Learning Swims like a duck

Knowledge

Reasoning

Planning

Action

Effector
Environment

Sensors

Sensor Data

Feature Extraction

Representation

Machine Learning
Final breakthrough, 358 years after its conjecture:
Knowledge “It was so indescribably beautiful; it was so simple and
so elegant. I couldn’t understand how I’d missed it and
Reasoning I just stared at it in disbelief for twenty minutes. Then
during the day I walked around the department, and
Planning I’d keep coming back to my desk looking to see if it
was still there. It was still there. I couldn’t contain
Action myself, I was so excited. It was the most important
moment of my working life. Nothing I ever do again
Effector will mean as much."
Environment

Sensors

Sensor Data

Feature Extraction

Representation

Machine Learning
Knowledge

Reasoning

Planning

Action

Effector
Environment

Sensors
Sensor Data

Feature Extraction
The promise of
Deep Learning
Representation

Machine Learning

Knowledge

Reasoning
The promise of
Planning Deep Reinforcement Learning
Action

Effector
Terminologies
• Agent
• State
• Action
• Policy
• Reward
• State Transition
Terminologies
Reinforcement Learning
Framework
At each step, the agent:
• Executes action
• Observe new state
• Receive reward
Environment and Actions

• Fully Observable (Chess) vs Partially Observable (Poker)


• Single Agent (Atari) vs Multi Agent (DeepTraffic)
• Deterministic (Cart Pole) vs Stochastic (DeepTraffic)
• Static (Chess) vs Dynamic (DeepTraffic)
• Discrete (Chess) vs Continuous (Cart Pole)
Major Components of an RL
Agent
An RL agent may be directly or indirectly trying to learn a:
• Policy: agent’s behavior function
• Value function: how good is each state and/or action
• Model: agent’s representation of the environment

𝑠0, 𝑎0, 𝑟1, 𝑠1, 𝑎1, 𝑟2, … , 𝑠𝑛−1, 𝑎𝑛−1, 𝑟𝑛, 𝑠𝑛


state Terminal state
action
reward
Policy
• Policy usually refers to rules and regulations or services of an organization
• Policy means making a decision based on current state
• While playing game we observe state on the screen and make decision accordingly
• Here policy is to avoid enemies and to collect coins.
Policy
• Formally, policy function can be defined as;

• policy function (Pi) maps the state action pair to a probability score
between 0 and 1.
• Pi is a conditional probability density function

• It is the probability of taking action ‘a’ while observing state ‘s’


Policy
• For-example, while observing the specific state shown below, the agent can make one of
the three actions. If we feed this state to policy function (pi), it will output three
probability scores.
• The probabilities output by the policy function will guide the agent to make decisions.
• With probabilities at hand, lets perform the random sampling.
Policy
• All of the three actions may be chosen but moving up has the highest
probability
• Decisions are made by the policy function.
• How to learn the policy function is the main theme in RL
• In current example, the agent actions are random
• Actions are sampled by the probabilities output by the policy
function.
Reward
• The goal of reinforcement learning is to maximize the accumulative reward.
• The choice of reward affects the policy.
• For example if the reward of winning the game is not higher than collecting the coin
then Mario will prefer to collect the coin instead of winning the game
State Transition
State Transition
• State Transition can be deterministic or random.
• Randomness in the transition is from the environment.
• In our case, the environment is the State Transition.
• The program of the game determines the next state.
• The movement of the enemy gumba is random and is determined by the environment.
Its not in our control
• Because of gumba’s randomness , the next state random.
• If we denote the state transition by function where s’ is the next state

• In practice we don’t have the state transition function because of the randomness (like
gumba in our case). Only environment has this function.
Rewards and Returns
• Return (aka cumulative future reward)

• Ut is a return at time t
• Ut is a sum of all the future rewards (from time t till end of game)
• Discounted return is more popular than the return defined above
Discounted Returns

• The selection of option 1 is obvious because future is full of uncertainty.


• This implies that the future rewards should be given lower weights.
Discounted Returns
Discounted Returns
• As future rewards are less important, therefore, they should be
discounted.

• If the future reward is equally important than the current reward the
value of gamma should be 1.
• If the future reward is less important than set gamma to a low
number.
• Note in the equation that the current reward is not discounted but
the future rewards are discounted.
Discounted Returns

• Discounted return is the weighted sum of the rewards from time t till
the end of the game.
• Suppose the game stops at time n, then the Discounted return is the
weighted sum of the rewards from time t till time n
Randomness in Returns
Randomness in Returns
Randomness in Returns - Observed ut
Randomness in Returns - Observed ut
• At time T, we have not observed rewards – Rt … Rn
• Rt … Rn are unknown random variables and are denoted by upper case letters
• Ut is a sum of Rt … Rn and hence unknown random variable.
Randomness in Returns - Observed ut
• Suppose the game has ended.
• At this time, we have observed all the rewards
• The rewards are therefore denoted by lower case letters.
• The sum of all the observed rewards, gives the return ut which is an observed value
• The ut is just a number. It does not have randomness.
Examples of Reinforcement Learning

Cart-Pole Balancing
• Goal — Balance the pole on top of a moving cart
• State — Pole angle, angular speed. Cart position, horizontal
velocity.
• Actions — horizontal force to the cart
• Reward — 1 at each time step if the pole is upright
Examples of Reinforcement Learning
Doom*
• Goal:
Eliminate all opponents
• State:
Raw game pixels of the game
• Actions:
Up, Down, Left, Right, Shoot, etc.
• Reward:
• Positive when eliminating an opponent,
negative when the agent is eliminated

* Added for important thought-provoking considerations of AI safety in the context of


autonomous weapons systems (see AGI lectures on the topic).
Examples of Reinforcement Learning

Grasping Objects with Robotic Arm


• Goal - Pick an object of different shapes
• State - Raw pixels from camera
• Actions – Move arm. Grasp.
• Reward - Positive when pickup is successful
Examples of Reinforcement Learning

Human Life
• Goal - Survival? Happiness?
• State - Sight. Hearing. Taste. Smell. Touch.
• Actions - Think. Move.
• Reward – Homeostasis?
3 Types of Reinforcement Learning

Model-based Value-based Policy-based


• Learn the model of • Learn the state or • Learn the stochastic
the world, then plan state-action value policy function that
using the model • Act by choosing best maps state to action
• Update model often action in state • Act by sampling policy
• Re-plan often • Exploration is a • Exploration is baked in
necessary add-on
Taxonomy of RL Methods
Value Functions
• Action-Value Function
• State-Value Function
Action-Value Function
• Ut is the sum of all the future rewards
• The agent’s goal is to maximize Ut (collect coins and avoid enemies)
• Ut can be used to evaluate current situation but
• At time t, the return Ut is a random variable. Its value is unknown
• The how can we use it to evaluate current situation
• The solution is to take expectation of Ut
Action-Value Function
• By integrating out the randomness in Ut, we can obtain a real number that
reflects how good a current situation is. (Mario is winning or loosing)
• Lets denote the real number by
• The action value function is denoted by
• It depends on the current state and current action.
• The expectation of Ut will eliminate the randomness in Ut
• The randomness in Ut is due to all the states and actions since time t.
Action-Value Function
• Treat current state and current action as observed value.
• Pretend that these are not random values.
• The action value function is a conditional expectation
• We take the expectation of Ut , given the observed values of st and at
• Besides st and at, it also depends on St+1 … Sn and At+1 … An
• These can be considered as random variables and can be integrated out by using
expectation.
Action-Value Function
• To compute expectation, we need PDF of S and A
• Environment generates a new state by randomly sampling from the state transition function
• P is a PDF of state
• The actions are randomly sampled by the policy function Pi
• It means the resulting expectation depends on Pi
• If the policy function Pi changes, the outcome of expectation will be different.
• It means the action value function Q PI depends on policy function Pi
Action-Value Function
• With different policy functions, Q pi will be different.
• With a better policy function Pi, Q pi becomes bigger.
• Q pi depends on the current state and current action. This is because we consider them as
observed values. They are not eliminated by the expectation.
• Q pi tells that given a current state, how good it is to take the current action
• Q pi can also be considered as a critique because it tells how good an agent’s performance is.
Q-Learning s
a
• State-action value function: Qπ(s,a)
r
• Expected return when starting in s,
performing a, and following π s’

• Q-Learning: Use any policy to estimate Q that maximizes future reward:


• Q directly approximates Q* (Bellman optimality equation)
• Independent of the policy being followed
• Only requirement: keep updating each (s,a) pair

Learning Rate Discount Factor

New State Old State Reward


Q-Learning: Value
Iteration

A1 A2 A3 A4
S1 +1 +2 -1 0
S2 +2 0 +1 -2
S3 -1 +1 0 -2
S4 -2 0 +1 +1
Example

You might also like