DSA5102 Lecture11
DSA5102 Lecture11
Li Qianxiao
Department of Mathematics
So far
We introduced two classes of machine learning problems
• Supervised Learning
• Unsupervised Learning
Planning
All of what we mean by goals and
purposes can be well thought of
as the maximization of the
The Reward expected value of the cumulative
Hypothesis sum of a received scalar signal
(called reward).
Examples
• Studying and getting good grades
• Learning to play a new musical instrument
• Winning at chess
• Navigating a maze
• An infant learning to walk
The Basic Components
Environment
Action
State
Interpreter
Reward
Agent
Examples
Rewards:
• +10 for each can picked up
• -1 for each meter moved
• -1000 for running out of battery
Another Example
The Reinforcement Learning Problem
The RL problem can be posed as follows:
𝑠3
https://en.wikipedia.org/wiki/Markov_chain
Non-Markovian or Non-time-
homogeneous Stochastic Processes
Example of non-Markovian process
• Drawing without replacement coins out of a bag of coins
consisting of 10 of each $1, 50c and 10c coins. Let be the total
value of coins drawn up to time
Essential elements
• Sequence of time steps:
• States:
• Actions: (union over all )
• Rewards:
State Evolution
Agent
Reward
State Action
Environment
Reward (Interpreter)
State
Transition Probability
For Markov chains, we have the transition probability
• If then is empty
• If , such that , and has a can, then
• …
Reward:
Charging
Station
The “Decision” Aspect: The Policy
The only way the agent has control over this system is through the
choice of actions.
Deterministic policies:
Then we write , i.e. deterministic policies are functions
The Goal of Choosing a Policy:
Returns
We want to maximize long-term rewards…
+3 -2
+0 7
-3
+5 -1
+4 +2 +2 -1
+4 +2 +3 +3
+1 +2 +1
+1 +4
+5
+6
+6
+3 -2
+0 +5 +1
+6 +1 -3
-3
+6 +5 -4 -1
The Complexity of Dynamic
Programming
Combining, we get
Exercise: derive this equation and show that there exists a unique
solution
Comparing Policies
We can compare policies via their values
• Given , we say if for all
• This is a partial order
Examples
• , , Then
• , , Then neither nor holds
Optimal Policy
We define an optimal policy to be any policy satisfying