0% found this document useful (0 votes)

35 views59 pages

Reinforcement Learning

The document discusses reinforcement learning as a method of teaching through experience, contrasting it with supervised learning which teaches by example. It outlines key concepts such as agents, states, actions, policies, and rewards, and explains the importance of maximizing cumulative rewards through various learning strategies. Additionally, it provides examples of reinforcement learning applications, including balancing a cart-pole, playing games, and human learning processes.

Uploaded by

Asma Ayub

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views59 pages

Reinforcement Learning

Uploaded by

Asma Ayub

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Reinforcement Learning

Supervised learning is “teach by example”:

Here’s some examples, now learn patterns in these
example.

Reinforcement learning is “teach by experience”:

Here’s a world, now learn patterns by exploring it.
Machine Learning:
Supervised vs
Reinforcement
Supervised learning is “teach by example”:
Here’s some examples, now learn patterns in these example.
Reinforcement learning is “teach by experience”:
Here’s a world, now learn patterns by exploring it.

Failure Success
Reinforcement Learning in Humans
• Human appear to learn to walk through “very few examples” of
trial and error. How is an open question…
• Possible answers:
• Hardware: 230 million years of bipedal movement data.
• Imitation Learning: Observation of other humans walking.
• Algorithms: Better than backpropagation and stochastic gradient descent
Environment
Open Question:
Sensors
What can be learned from
Sensor Data data?
Feature Extraction

Representation

Machine Learning
Knowledge

Reasoning

Planning

Action

Effector
Environment

Sensors

Sensor Data

Feature Extraction

Representation

Machine Learning

Knowledge
GPS
Reasoning Camera Radar
Lidar
(Visible, Infrared)
Planning

Action
Networking
Effector Stereo Camera Microphone IMU
(Wired, Wireless)
Environment

Sensors

Sensor Data

Feature Extraction

Representation

Machine Learning
Knowledge

Reasoning

Planning

Action

Effector
Environment

Sensors

Sensor Data

Feature Extraction

Representation

Machine Learning

Knowledge

Reasoning
Planning

Action

Effector
Environment
Image Recognition: Audio Recognition:
Sensors If it looks like a duck Quacks like a duck

Sensor Data

Feature Extraction

Representation
Activity Recognition:
Machine Learning Swims like a duck

Knowledge

Reasoning

Planning

Action

Effector
Environment

Sensors

Sensor Data

Feature Extraction

Representation

Machine Learning
Final breakthrough, 358 years after its conjecture:
Knowledge “It was so indescribably beautiful; it was so simple and
so elegant. I couldn’t understand how I’d missed it and
Reasoning I just stared at it in disbelief for twenty minutes. Then
during the day I walked around the department, and
Planning I’d keep coming back to my desk looking to see if it
was still there. It was still there. I couldn’t contain
Action myself, I was so excited. It was the most important
moment of my working life. Nothing I ever do again
Effector will mean as much."
Environment

Sensors

Sensor Data

Feature Extraction

Representation

Machine Learning
Knowledge

Reasoning

Planning

Action

Effector
Environment

Sensors
Sensor Data

Feature Extraction
The promise of
Deep Learning
Representation

Machine Learning

Knowledge

Reasoning
The promise of
Planning Deep Reinforcement Learning
Action

Effector
Terminologies
• Agent
• State
• Action
• Policy
• Reward
• State Transition
Terminologies
Reinforcement Learning
Framework
At each step, the agent:
• Executes action
• Observe new state
• Receive reward
Environment and Actions

• Fully Observable (Chess) vs Partially Observable (Poker)

• Single Agent (Atari) vs Multi Agent (DeepTraffic)
• Deterministic (Cart Pole) vs Stochastic (DeepTraffic)
• Static (Chess) vs Dynamic (DeepTraffic)
• Discrete (Chess) vs Continuous (Cart Pole)
Major Components of an RL
Agent
An RL agent may be directly or indirectly trying to learn a:
• Policy: agent’s behavior function
• Value function: how good is each state and/or action
• Model: agent’s representation of the environment

𝑠0, 𝑎0, 𝑟1, 𝑠1, 𝑎1, 𝑟2, … , 𝑠𝑛−1, 𝑎𝑛−1, 𝑟𝑛, 𝑠𝑛

state Terminal state
action
reward
Policy
• Policy usually refers to rules and regulations or services of an organization
• Policy means making a decision based on current state
• While playing game we observe state on the screen and make decision accordingly
• Here policy is to avoid enemies and to collect coins.
Policy
• Formally, policy function can be defined as;

• policy function (Pi) maps the state action pair to a probability score
between 0 and 1.
• Pi is a conditional probability density function

• It is the probability of taking action ‘a’ while observing state ‘s’

Policy
• For-example, while observing the specific state shown below, the agent can make one of
the three actions. If we feed this state to policy function (pi), it will output three
probability scores.
• The probabilities output by the policy function will guide the agent to make decisions.
• With probabilities at hand, lets perform the random sampling.
Policy
• All of the three actions may be chosen but moving up has the highest
probability
• Decisions are made by the policy function.
• How to learn the policy function is the main theme in RL
• In current example, the agent actions are random
• Actions are sampled by the probabilities output by the policy
function.
Reward
• The goal of reinforcement learning is to maximize the accumulative reward.
• The choice of reward affects the policy.
• For example if the reward of winning the game is not higher than collecting the coin
then Mario will prefer to collect the coin instead of winning the game
State Transition
State Transition
• State Transition can be deterministic or random.
• Randomness in the transition is from the environment.
• In our case, the environment is the State Transition.
• The program of the game determines the next state.
• The movement of the enemy gumba is random and is determined by the environment.
Its not in our control
• Because of gumba’s randomness , the next state random.
• If we denote the state transition by function where s’ is the next state

• In practice we don’t have the state transition function because of the randomness (like
gumba in our case). Only environment has this function.
Rewards and Returns
• Return (aka cumulative future reward)

• Ut is a return at time t
• Ut is a sum of all the future rewards (from time t till end of game)
• Discounted return is more popular than the return defined above
Discounted Returns

• The selection of option 1 is obvious because future is full of uncertainty.

• This implies that the future rewards should be given lower weights.
Discounted Returns
Discounted Returns
• As future rewards are less important, therefore, they should be
discounted.

• If the future reward is equally important than the current reward the
value of gamma should be 1.
• If the future reward is less important than set gamma to a low
number.
• Note in the equation that the current reward is not discounted but
the future rewards are discounted.
Discounted Returns

• Discounted return is the weighted sum of the rewards from time t till
the end of the game.
• Suppose the game stops at time n, then the Discounted return is the
weighted sum of the rewards from time t till time n
Randomness in Returns
Randomness in Returns
Randomness in Returns - Observed ut
Randomness in Returns - Observed ut
• At time T, we have not observed rewards – Rt … Rn
• Rt … Rn are unknown random variables and are denoted by upper case letters
• Ut is a sum of Rt … Rn and hence unknown random variable.
Randomness in Returns - Observed ut
• Suppose the game has ended.
• At this time, we have observed all the rewards
• The rewards are therefore denoted by lower case letters.
• The sum of all the observed rewards, gives the return ut which is an observed value
• The ut is just a number. It does not have randomness.
Examples of Reinforcement Learning

Cart-Pole Balancing
• Goal — Balance the pole on top of a moving cart
• State — Pole angle, angular speed. Cart position, horizontal
velocity.
• Actions — horizontal force to the cart
• Reward — 1 at each time step if the pole is upright
Examples of Reinforcement Learning
Doom*
• Goal:
Eliminate all opponents
• State:
Raw game pixels of the game
• Actions:
Up, Down, Left, Right, Shoot, etc.
• Reward:
• Positive when eliminating an opponent,
negative when the agent is eliminated

* Added for important thought-provoking considerations of AI safety in the context of

autonomous weapons systems (see AGI lectures on the topic).
Examples of Reinforcement Learning

Grasping Objects with Robotic Arm

• Goal - Pick an object of different shapes
• State - Raw pixels from camera
• Actions – Move arm. Grasp.
• Reward - Positive when pickup is successful
Examples of Reinforcement Learning

Human Life
• Goal - Survival? Happiness?
• State - Sight. Hearing. Taste. Smell. Touch.
• Actions - Think. Move.
• Reward – Homeostasis?
3 Types of Reinforcement Learning

Model-based Value-based Policy-based

• Learn the model of • Learn the state or • Learn the stochastic
the world, then plan state-action value policy function that
using the model • Act by choosing best maps state to action
• Update model often action in state • Act by sampling policy
• Re-plan often • Exploration is a • Exploration is baked in
necessary add-on
Taxonomy of RL Methods
Value Functions
• Action-Value Function
• State-Value Function
Action-Value Function
• Ut is the sum of all the future rewards
• The agent’s goal is to maximize Ut (collect coins and avoid enemies)
• Ut can be used to evaluate current situation but
• At time t, the return Ut is a random variable. Its value is unknown
• The how can we use it to evaluate current situation
• The solution is to take expectation of Ut
Action-Value Function
• By integrating out the randomness in Ut, we can obtain a real number that
reflects how good a current situation is. (Mario is winning or loosing)
• Lets denote the real number by
• The action value function is denoted by
• It depends on the current state and current action.
• The expectation of Ut will eliminate the randomness in Ut
• The randomness in Ut is due to all the states and actions since time t.
Action-Value Function
• Treat current state and current action as observed value.
• Pretend that these are not random values.
• The action value function is a conditional expectation
• We take the expectation of Ut , given the observed values of st and at
• Besides st and at, it also depends on St+1 … Sn and At+1 … An
• These can be considered as random variables and can be integrated out by using
expectation.
Action-Value Function
• To compute expectation, we need PDF of S and A
• Environment generates a new state by randomly sampling from the state transition function
• P is a PDF of state
• The actions are randomly sampled by the policy function Pi
• It means the resulting expectation depends on Pi
• If the policy function Pi changes, the outcome of expectation will be different.
• It means the action value function Q PI depends on policy function Pi
Action-Value Function
• With different policy functions, Q pi will be different.
• With a better policy function Pi, Q pi becomes bigger.
• Q pi depends on the current state and current action. This is because we consider them as
observed values. They are not eliminated by the expectation.
• Q pi tells that given a current state, how good it is to take the current action
• Q pi can also be considered as a critique because it tells how good an agent’s performance is.
Q-Learning s
a
• State-action value function: Qπ(s,a)
r
• Expected return when starting in s,
performing a, and following π s’

• Q-Learning: Use any policy to estimate Q that maximizes future reward:

• Q directly approximates Q* (Bellman optimality equation)
• Independent of the policy being followed
• Only requirement: keep updating each (s,a) pair

Learning Rate Discount Factor

New State Old State Reward

Q-Learning: Value
Iteration

A1 A2 A3 A4
S1 +1 +2 -1 0
S2 +2 0 +1 -2
S3 -1 +1 0 -2
S4 -2 0 +1 +1
Example

Combined Set
No ratings yet
Combined Set
106 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
AI Unit - 3
No ratings yet
AI Unit - 3
102 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Introduction To Reinforcement Learning
No ratings yet
Introduction To Reinforcement Learning
62 pages
RL Lecturer
No ratings yet
RL Lecturer
38 pages
Lecture 9 - Reinforced Learning
No ratings yet
Lecture 9 - Reinforced Learning
18 pages
ML Unit-V
No ratings yet
ML Unit-V
20 pages
A (Long) Peek Into Reinforcement Learning - Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning - Lil'Log
23 pages
Kguh
No ratings yet
Kguh
38 pages
RL & DL Notes
No ratings yet
RL & DL Notes
73 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
Sections
No ratings yet
Sections
76 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
RL & DL Notes
No ratings yet
RL & DL Notes
43 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
Reinforcement Learning and Robotics
No ratings yet
Reinforcement Learning and Robotics
35 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
ML Unit-4
No ratings yet
ML Unit-4
10 pages
Module - 1 - Reinforcement Learning and Markov Decision Process
No ratings yet
Module - 1 - Reinforcement Learning and Markov Decision Process
19 pages
Module 1
No ratings yet
Module 1
72 pages
Artificial Intelligence CS-3431w (V2)
No ratings yet
Artificial Intelligence CS-3431w (V2)
23 pages
Module 01
No ratings yet
Module 01
66 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
Unit 5
No ratings yet
Unit 5
10 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Reinforcement Learning Mastery Path
No ratings yet
Reinforcement Learning Mastery Path
18 pages
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
No ratings yet
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
35 pages
GST-IFP8 - DS10104799 Datasheet
0% (1)
GST-IFP8 - DS10104799 Datasheet
3 pages
PastPapers Harony P4 2024
No ratings yet
PastPapers Harony P4 2024
484 pages
K - Nearest Neighbors
No ratings yet
K - Nearest Neighbors
33 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Lecture-21-Exception Handling
No ratings yet
Lecture-21-Exception Handling
15 pages
DsPIC33 EP64 GS502 Datasheet
No ratings yet
DsPIC33 EP64 GS502 Datasheet
390 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
29 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
136 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
117 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
116 pages
L11 Reinforcement Learning 1
No ratings yet
L11 Reinforcement Learning 1
18 pages
Simple Linear and Logistic Regression
No ratings yet
Simple Linear and Logistic Regression
81 pages
IntroductiontoRL BR
No ratings yet
IntroductiontoRL BR
22 pages
Lecture 7
No ratings yet
Lecture 7
29 pages
21ai020 & Reinforcement Learning: The Agent-Environment Interface
No ratings yet
21ai020 & Reinforcement Learning: The Agent-Environment Interface
8 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Lecture 8
No ratings yet
Lecture 8
26 pages
Lecture Week12
No ratings yet
Lecture Week12
37 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Unit 5
No ratings yet
Unit 5
45 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
CSE101-Course Introduction-S21
No ratings yet
CSE101-Course Introduction-S21
26 pages
4.1 Reinforcement Learning 2
No ratings yet
4.1 Reinforcement Learning 2
31 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
No ratings yet
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
23 pages
Barangay Records Management System Chapter 1
100% (2)
Barangay Records Management System Chapter 1
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
No ratings yet
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
26 pages
37 RL
No ratings yet
37 RL
18 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
Computer Architecture Introduction
No ratings yet
Computer Architecture Introduction
27 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
An Introduction To Deep ReinforcementLearning
No ratings yet
An Introduction To Deep ReinforcementLearning
65 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
SM Campaign LT Explained
No ratings yet
SM Campaign LT Explained
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
Naive Bayes
No ratings yet
Naive Bayes
24 pages
Guia Completa de La Acuarela
No ratings yet
Guia Completa de La Acuarela
254 pages
Automata
No ratings yet
Automata
27 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Lecture 4
No ratings yet
Lecture 4
37 pages
Azure MSP Playbook
No ratings yet
Azure MSP Playbook
56 pages
Assignment 15 Modern AI
No ratings yet
Assignment 15 Modern AI
3 pages
Gradient Descent and Cost Function
No ratings yet
Gradient Descent and Cost Function
35 pages
Catchlogs - 2024-12-11 at 13-30-58 - 6.111.2 - .Java
No ratings yet
Catchlogs - 2024-12-11 at 13-30-58 - 6.111.2 - .Java
22 pages
Automata
No ratings yet
Automata
20 pages
Lecture 22 Pipelining
No ratings yet
Lecture 22 Pipelining
13 pages
Lecture 2 A
No ratings yet
Lecture 2 A
10 pages
S130 SDS v2.0
No ratings yet
S130 SDS v2.0
87 pages
Cs 101 Lecture - Unit1-Week1-2
No ratings yet
Cs 101 Lecture - Unit1-Week1-2
33 pages
3 Data Acquisition - Xid-10710319 - 2
No ratings yet
3 Data Acquisition - Xid-10710319 - 2
32 pages
Adobe Creative Suite 3 Design Premium: Deliver Innovative Ideas in Print, Web, and Mobile
No ratings yet
Adobe Creative Suite 3 Design Premium: Deliver Innovative Ideas in Print, Web, and Mobile
18 pages
Assignment 3
No ratings yet
Assignment 3
1 page
HB Aircraft Industries AG HB-23/2400
No ratings yet
HB Aircraft Industries AG HB-23/2400
24 pages
SAP Ticket Error
No ratings yet
SAP Ticket Error
9 pages
Sap IT Leads
No ratings yet
Sap IT Leads
5 pages
Freescalpingindicator PDF
No ratings yet
Freescalpingindicator PDF
7 pages
Jobvacancyresult Com
No ratings yet
Jobvacancyresult Com
4 pages
IT Vulnerability and Remediation Management Security Standard
No ratings yet
IT Vulnerability and Remediation Management Security Standard
8 pages
A System Analysis Approach
No ratings yet
A System Analysis Approach
23 pages
Training Activity Matrix
No ratings yet
Training Activity Matrix
9 pages
Brochure AuthPoint
No ratings yet
Brochure AuthPoint
4 pages
Csat
No ratings yet
Csat
5 pages
Sample Program: XGB-INV IG5A (RS-485 Modbus RTU)
No ratings yet
Sample Program: XGB-INV IG5A (RS-485 Modbus RTU)
4 pages
HEC-RAS 507 Unsteady
No ratings yet
HEC-RAS 507 Unsteady
9 pages
Blockchain Based Framework For Software Development Using DevOps
No ratings yet
Blockchain Based Framework For Software Development Using DevOps
6 pages
Lab 4 - SQLStatements
No ratings yet
Lab 4 - SQLStatements
3 pages
Tourism Management: Ankita Sharma, Swati Sharma, Monica Chaudhary
No ratings yet
Tourism Management: Ankita Sharma, Swati Sharma, Monica Chaudhary
10 pages
K-Map Method
No ratings yet
K-Map Method
3 pages
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet