7- Reinforcement Learning

Uploaded by

jay vashishtha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

7- Reinforcement Learning

Uploaded by

jay vashishtha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Reinforcement learning

• Reinforcement Learning is the agent must sense the environment,

learns to behave (act) in a environment by performing actions
(reinforcement) and seeing the results.
• Task
- Learn how to behave successfully to achieve a goal while
interacting with an external environment.
- The goal of the agent is to learn an action policy that
maximizes the total reward it will receive from any starting
state.
• Examples
– Game playing: player knows whether it win or lose, but not
know how to move at each step
Applications
• A robot cleaning the room and recharging its battery

• Robot-soccer

• invest in shares

• Modeling the economy through rational agents

• Learning how to fly a helicopter

• Scheduling planes to their destinations

Reinforcement Learning Process
• RL contains two primary components:
1. Agent (A) – RL algorithm that learns from trial and error
2. Environment – World Space in which the agent moves (interact and take action)
• State (S) – Current situation returned by the environment
• Reward (R) – An immediate return from the environment to appraise the last action
• Policy (π) –Agent uses this approach to decide the next action based on the current state
• Value (V) – Expected long-term return with discount. Oppose to the short-term reward (R)
• Action-Value (Q) – Similar to value except it contains an additional parameter, the current
action (A)

Figure shows RL is learning from interaction

RL Approaches
• Two approaches
– Model based approach RL:
• learn the model, and use it to derive the optimal policy.
e.g Adaptive dynamic learning(ADP) approach
– Model free approach RL:
• derive the optimal policy without learning the model.
e.g LMS and Temporal difference approach
• Passive learning
– The agent imply watches the world during transition and tries to
learn the utilities in various states
• Active learning
– The agent not simply watches, but also acts on the environment
Example

 :S A

• Immediate reward is worth more than future reward.

• Reward Maximization - Agent is trained to take best (optimal)
action to get maximum reward
Reward Maximization

 :S A

• Exploration – Search and capture more information about the

environment
• Exploitation – Use the already known information to maximize the
rewards
Reinforcement learning model

• Each percept(e) is enough to determine the State(the state is

accessible)

• Agent’s task: Find a optimal policy by mapping states of

environment to actions of the agent, that maximize long-run
measure of the reward (reinforcement)

• It can be modeled as Markov Decision Process (MDP) model.

• Markov decision process (MDP) is a a mathematical

framework for modeling decision making i.e mapping a
solution in reinforcement learning.
MDP model
• MDP model <S,T,A,R>
• S– set of states
• A– set of actions
Agent • Transition Function: T(s,a,s’) =
P(s’|s,a) – the probability of transition
State Action from s to s’ given action a
Reward
T(s,a)  s’
Environment
• Reward Function: r(s,a)  r the
expected reward for taking action a in
a0 a1 a2 state s
s0 s1 s2 s3
r0 r1 r2
R ( s , a )   P ( s ' | s , a ) r ( s , a, s ' )
s'

R ( s , a )   T ( s , a, s ' ) r ( s , a , s ' )
s'
MDP - Example I
• Consider the graph, and find the shortest path from a node S to a
goal node G.
• Set of states {S, T, U, V}
• Action – Traversal from one state to another state
• Reward - Traversing an edge provides “length edge” in dollars.
• Policy – Path considered to reach the destination {STV}

14 T
S 51
-22
25
-5 V G
U
15
Q - Learning
• Q-Learning is a value-based reinforcement learning algorithm uses Q-
values (action values) to iteratively improve the behavior of the learning
agent.
• Goal is to maximize the Q value to find the optimal action-selection policy.
• The Q table helps to find the best action for each state and maximize the
expected reward.
• Q-Values / Action-Values: Q-values are defined for states and actions.
• Q(s, a) denotes an estimation of the action a at the state s.
• This estimation of Q(s, a) will be iteratively computed using the TD-
Update rule.
• Reward: At every transition, the agent observes a reward for every action
from the environment, and then transits to another state.
• Episode: If at any point of time the agent ends up in one of the terminating
states i.e. there are no further transition possible is called completion of an
episode.
Q-Learning
• Initially agent explore the environment and update the Q-Table. When the
Q-Table is ready, the agent will start to exploit the environment and taking
better actions.
• It is an off-policy control algorithm i.e. the updated policy is different
from the behavior policy. It estimates the reward for future actions and
appends a value to the new state without any greedy policy
Temporal Difference or TD-Update:
• Estimate the value of Q is applied at every time step of the agents
interaction with the environment
Advantage:
• Converges to an optimal policy in both deterministic and nondeterministic
MDPs.
Disadvantage:
• Suitable for small problems.
Understanding the Q – Learning
• Building Environment contains 5 rooms that are connected with doors.
• Each room is numbered from 0 to 4. The building outside is numbered as 5.
• Doors from room 1 and 4 leads to the building outside 5.
• Problem: Agent can place at any one of the rooms (0, 1, 2, 3, 4). Agent’s
goal is to reach the building outside (room 5).
Understanding the Q – Learning
• Represent the room in the graph.
• Room number is the state and door is the edge.
Understanding the Q – Learning
• Assign the Reward value to each
door.
• The doors lead immediately to
target is assigned an instant reward
of 100.
• Other doors not directly connected
to the target room have zero
reward.
• For example, doors are two-way ( 0
leads to 4, and 4 leads back to 0 ),
two edges are assigned to each
room.
• Each edge contains an instant
reward value
Understanding the Q – Learning
• Let consider agent starts from state s (Room) 2.
• Agent’s movement from one state to another state is action a.
• Agent is traversing from state 2 to state 5 (Target).
– Initial state = current state i.e. state 2
– Transition State 2  State 3
– Transition State 3  State (2, 1, 4)
– Transition State 4  State 5
Understanding the Q – Learning
Understanding the Q – Learning: Prepare matrix Q
• Matrix Q is the memory of the agent in which learned information
from experience is stored.
• Row denotes the current state of the agent
• Column denotes the possible actions leading to the next state
Compute Q matrix:
Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)]
• Gamma is discounting factor for future rewards. Its range is 0 to 1.
i.e. 0 < Gamma <1.
• Future rewards are less valuable than current rewards so they must
be discounted.
• If Gamma is closer to 0, the agent will tend to consider only the
immediate rewards.
• If Gamma is closer to 1, the agent will tend to consider only future
rewards with higher edge weights.
Q – Learning Algorithm
• Set the gamma parameter
• Set environment rewards in matrix R
• Initialize matrix Q as Zero
– Select random initial (source) state
• Set initial state s = current state
– Select one action a among all possible actions using exploratory policy
• Take this possible action a, going to the next state s’.
• Observe reward r
– Get maximum Q value to go to next state based on all possible
actions
• Compute:
– Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)]
• Repeat the above steps until reach the goal state i.e current state =
goal state
Example: Q – Learning

Action
0 1 2 3 4 5
State 0 1 2 3 4 5
Example: Q – Learning
3 1

4
1 5

• Update the Matrix Q. Action

0 1 2 3 4 5 State 0 1 2 3 4 5
𝟎 −1 −1 −1 −1 0 −1 𝟎 0 0 0 0 0 0
𝟏 −1 −1 −1 0 −1 100 𝟏 0 0 0 0 0 𝟏𝟎𝟎
𝑅=𝟐 −1 −1 −1 0 −1 100 Q = 𝟐 0 0 0 0 0 0
𝟑 −1 0 0 −1 0 −1 𝟑 0 0 0 0 0 0
𝟒 0 −1 −1 0 −1 100 𝟒 0 0 0 0 0 0
𝟓 −1 0 −1 −1 0 100 𝟓 0 0 0 0 0 0
• For next episode, choose next state 3 randomly that 3
becomes current state. 1
• State 3 contains 3 choices i.e. state 1, 2 or 4. 5
• Let’s choose state 1.
2
• Compute max Q value to go to next state based 3
on all possible actions. 4
Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)]
• Q(3,1) = R(3,1) + 0.8 * max[Q(1,3), Q(1,5)]
= 0 + 0.8 * max[0, 100] = 0 + 80 = 80
• Update the Matrix Q. Action
0 1 2 3 4 5 State 0 1 2 3 4 5
𝟎 −1 −1 −1 −1 0 −1 𝟎 0 0 0 0 0 0
𝟏 −1 −1 −1 0 −1 100 𝟏 0 0 0 0 0 100
𝑅=𝟐 −1 −1 −1 0 −1 100 Q = 𝟐 0 0 0 0 0 0
𝟑 −1 0 0 −1 0 −1 𝟑 0 𝟖𝟎 0 0 0 0
𝟒 0 −1 −1 0 −1 100 𝟒 0 0 0 0 0 0
𝟓 −1 0 −1 −1 0 100 𝟓 0 0 0 0 0 0
• For next episode, next state 1 becomes current state 1
3
• Repeat the inner loop due to 1 is not target state
• From State 1, either can go to 3 or 5.
4
• Let’s choose state 5. 1 5
• Compute max Q value to go to next state based
5
on all possible actions.
• Q(state, action) = R(state, action) + Gamma * max[Q(next state, all actions)]
• Q(1,5) = R(1,5) + 0.8 * max[Q(5,1), Q(5,4), Q(5,5)]
= 100 + 0.8 * max[0, 0, 0] = 100 + 0 = 100
• Q remains the same due to Q(1,5) is already fed into the agent. Stop process
Action
0 1 2 3 4 5 State 0 1 2 3 4 5
𝟎 −1 −1 −1 −1 0 −1 𝟎 0 0 0 0 0 0
𝟏 −1 −1 −1 0 −1 100 𝟏 0 0 0 0 0 100
𝑅=𝟐 −1 −1 −1 0 −1 100 Q = 𝟐 0 0 0 0 0 0
𝟑 −1 0 0 −1 0 −1 𝟑 0 𝟖𝟎 0 0 0 0
𝟒 0 −1 −1 0 −1 100 𝟒 0 0 0 0 0 0
𝟓 −1 0 −1 −1 0 100 𝟓 0 0 0 0 0 0
References

• Tom Markiewicz& Josh Zheng,Getting started with Artificial Intelligence,

Published by O’Reilly Media,2017
• Stuart J. Russell and Peter Norvig,Artificial Intelligence A Modern
Approach
• Richard Szeliski, Computer Vision: Algorithms and Applications, Springer
2010

Roadmap A2+ StudentBook Contents
100% (1)
Roadmap A2+ StudentBook Contents
4 pages
Individual Learning Monitoring Plan Template Grade 1 TULIP
86% (7)
Individual Learning Monitoring Plan Template Grade 1 TULIP
7 pages
AI (IT) UNIT-5
No ratings yet
AI (IT) UNIT-5
43 pages
Sections
No ratings yet
Sections
76 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
lecture 9 Reiforcement learning (1)
No ratings yet
lecture 9 Reiforcement learning (1)
29 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
37 RL
No ratings yet
37 RL
18 pages
Artificial Intelligence: Computer Science & Engineering, Khulna University
No ratings yet
Artificial Intelligence: Computer Science & Engineering, Khulna University
30 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Reinforedu
No ratings yet
Reinforedu
46 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Unit 1
No ratings yet
Unit 1
18 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
MLT Unit-5 notes
No ratings yet
MLT Unit-5 notes
17 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
ReinforcementLearning
No ratings yet
ReinforcementLearning
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
10. Learning Task
No ratings yet
10. Learning Task
14 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
34 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
Unit-8 - Reinforcement Learning
No ratings yet
Unit-8 - Reinforcement Learning
52 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
112 Q Learning N
100% (1)
112 Q Learning N
15 pages
Unit 5
No ratings yet
Unit 5
45 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
15 pages
ML unit 4
No ratings yet
ML unit 4
17 pages
Q Learning Ejemplo
100% (1)
Q Learning Ejemplo
11 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
No ratings yet
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
64 pages
Lec 09
No ratings yet
Lec 09
26 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Unit 4
No ratings yet
Unit 4
12 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
Penultimate Draft: Philosophy of Literature Is Divided Into Seven Chapters Covering Ambitious Subject
No ratings yet
Penultimate Draft: Philosophy of Literature Is Divided Into Seven Chapters Covering Ambitious Subject
4 pages
Marketing RA
No ratings yet
Marketing RA
3 pages
Lesson Plan for Lower Secondary
No ratings yet
Lesson Plan for Lower Secondary
14 pages
Barbara Montero - The Body Problem
No ratings yet
Barbara Montero - The Body Problem
19 pages
Taskwork Processes: Boundary Spanning
No ratings yet
Taskwork Processes: Boundary Spanning
17 pages
Utilitarianism 25 Marker
No ratings yet
Utilitarianism 25 Marker
2 pages
Rmathews Week 8 Final Map Inclusive Educational Systems
No ratings yet
Rmathews Week 8 Final Map Inclusive Educational Systems
11 pages
Anecdotal Record
83% (6)
Anecdotal Record
5 pages
Session Guide Ap
No ratings yet
Session Guide Ap
3 pages
Document
No ratings yet
Document
22 pages
ProEd 6
No ratings yet
ProEd 6
1 page
Psychodiagnostics ( MAPC 12)
No ratings yet
Psychodiagnostics ( MAPC 12)
20 pages
Personal Best A1 Students Book SU
No ratings yet
Personal Best A1 Students Book SU
10 pages
Introduction To Technical Writing
No ratings yet
Introduction To Technical Writing
15 pages
E-Reading Course Description
No ratings yet
E-Reading Course Description
7 pages
Ensemble Performance: Elaine Goodman
No ratings yet
Ensemble Performance: Elaine Goodman
10 pages
A Word and Its Part: Roots, Affixes and Their Shapes
No ratings yet
A Word and Its Part: Roots, Affixes and Their Shapes
15 pages
"Everything at Once" Noun Verb Adjective Adverb of Time Adverb of Place Name: Date
No ratings yet
"Everything at Once" Noun Verb Adjective Adverb of Time Adverb of Place Name: Date
2 pages
Ashley Flowers Psychology Lab Report 2-2
No ratings yet
Ashley Flowers Psychology Lab Report 2-2
12 pages
Is It Hard To Learn AI My Personal AI Learning Journey - MLTut
No ratings yet
Is It Hard To Learn AI My Personal AI Learning Journey - MLTut
13 pages
Revised 2012 Greek Placement Exam Study Guide
No ratings yet
Revised 2012 Greek Placement Exam Study Guide
6 pages
A Developmental Evaluation Primer - en
No ratings yet
A Developmental Evaluation Primer - en
38 pages
Summit L1 - Scope and Sequence
No ratings yet
Summit L1 - Scope and Sequence
7 pages
L.P - Vii
No ratings yet
L.P - Vii
11 pages
Activity 8.2 - Reflecrion - Skip Telling Our Kids To Dream High
No ratings yet
Activity 8.2 - Reflecrion - Skip Telling Our Kids To Dream High
4 pages
GST 111
No ratings yet
GST 111
31 pages
Web Content Mining Techniques Tools & Algorithms - A Comprehensive Study
No ratings yet
Web Content Mining Techniques Tools & Algorithms - A Comprehensive Study
6 pages
Des
No ratings yet
Des
2 pages