Lecture 29 RL
Lecture 29 RL
Lecture – 29
DISCOVER . LEARN . EMPOWER
Re-inforcement Learning
1
COURSE OBJECTIVES
The Course aims to:
•Understand and apply various data handling and visualization
techniques.
•Understand about some basic learning algorithms and techniques
and their applications, as well as general questions related to
analysing and handling large data sets.
•To develop skills of supervised and unsupervised learning
techniques and implementation of these to solve real life
problems.
•To develop basic knowledge on the machine techniques to build
an intellectual machine for making decisions behalf of humans.
•To develop skills for selecting suitable model parameters and
apply them for designing optimized machine learning applications.
2
COURSE OUTCOMES
On completion of this course, the students shall be able to:-
CO2 Understand data pre-processing techniques and apply these for data
cleaning.
CO3 Identify and implement simple learning strategies using data science
and statistics principles.
3
Unit-3 Syllabus
4
SUGGESTIVE READINGS
TEXT BOOKS:
T1: Tom.M.Mitchell, “Machine Learning, McGraw Hill International Edition”.
T2: Ethern Alpaydin,” Introduction to Machine Learning. Eastern Economy Edition, Prentice Hall
of India, 2005”.
T3: Andreas C. Miller, Sarah Guido, Introduction to Machine Learning with Python, O’REILLY
(2001).
REFERENCE BOOKS:
R1 Sebastian Raschka, Vahid Mirjalili, Python Machine Learning, (2014)
R2 Richard O. Duda, Peter E. Hart, David G. Stork, “Pattern Classification, Wiley, 2nd Edition”.
R3 Christopher Bishop, “Pattern Recognition and Machine Learning, illustrated Edition, Springer,
2006”.
5
What is learning?
Learning takes place as a result of interaction
between an agent and the world, the idea
behind learning is that
Percepts received by an agent should be used not
only for acting, but also for improving the agent’s
ability to behave optimally in the future to achieve
the goal.
Learning types
Learning types
Supervised learning:
a situation in which sample (input, output) pairs of the
function to be learned can be perceived or are given
You can think it as if there is a kind teacher
Reinforcement learning:
in the case of the agent acts on its environment, it
receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to
achieve its goal
Reinforcement learning
Task
Learn how to behave successfully to achieve a
goal while interacting with an external
environment
Learn via experiences!
Examples
Game playing: player knows whether it win or
lose, but not know how to move at each step
Control: a traffic system can measure the delay of
cars, but not know how to decrease it.
RL is learning from interaction
RL model
Each percept(e) is enough to determine the
State(the state is accessible)
The agent can decompose the Reward
component from a percept.
The agent task: to find a optimal policy, mapping
states to actions, that maximize long-run measure
of the reinforcement
Think of reinforcement as reward
Can be modeled as MDP model!
Review of MDP model
MDP model <S,T,A,R>
• S– set of states
Agent • A– set of actions
• T(s,a,s’) = P(s’|s,a)– the
State Action probability of transition from
Reward
s to s’ given action a
Environment • R(s,a)– the expected reward
for taking action a in state s
R ( s, a ) P ( s ' | s, a ) r ( s, a, s ' )
a0 a1 a2 s'
s0 s1 s2 s3 R ( s, a ) T ( s, a, s ' ) r ( s, a, s ' )
r0 r1 r2 s'
Model based v.s.Model free
approaches
But, we don’t know anything about the environment
model—the transition function T(s,a,s’)
Here comes two approaches
Model based approach RL:
learn the model, and use it to derive the optimal policy.
e.g Adaptive dynamic learning(ADP) approach
(2,3)
(2,2)
(1,1)
(3,1)
(4,1)
(4,2)
Adaptive dynamic programming(ADP)
in passive learning
Different with LMS and TD method(model free
approaches)
ADP is a model based approach!
The updating rule for passive learning
U ( s ) T ( s, s ' )( r ( s, s ' ) U ( s ' ))
s'
(3,3)
(2,3)
(1,1)
(3,1)
(4,1)
(4,2)
Active learning
An active agent must consider
what actions to take?
what their outcomes maybe(both on learning and receiving the
rewards in the long run)?
Update utility equation
U ( s ) max ( R( s, a ) T ( s, a, s ' )U ( s ' ))
a
s'
Rule to chose action
a arg max ( R( s, a ) T ( s, a, s ' )U ( s ' ))
a s'
Active ADP algorithm
For each s, initialize U(s) , T(s,a,s’) and R(s,a)
Initialize s to current state that is perceived
Loop forever
{
Select an action a and execute it (using current model R and T) using
a arg max ( R ( s, a ) T ( s, a, s ' )U ( s ' ))
a s'
s = s’
}
How to learn model?
Use the transition tuple <s, a, s’, r> to learn T(s,a,s’) and
R(s,a). That’s supervised learning!
Since the agent can get every transition (s, a, s’,r) directly, so take
(s,a)/s’ as an input/output example of the transition probability
function T.
Different techniques in the supervised learning(see further reading
for detail)
Use r and T(s,a,s’) to learn R(s,a)
R ( s, a ) T ( s, a, s ' ) r
s'
ADP approach pros and cons
Pros:
ADP algorithm converges far faster than LMS and Temporal
learning. That is because it use the information from the the model
of the environment.
Cons:
Intractable for large state space
In each step, update U for all states
Improve this by prioritized-sweeping (see further reading for detail)
Another model free method–
TD-Q learning
Define Q-value function
U ( s ) max Q( s, a )
a