0% found this document useful (0 votes)
33 views44 pages

DSA5102 Lecture11

lecture11

Uploaded by

gjpnwmdpz7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views44 pages

DSA5102 Lecture11

lecture11

Uploaded by

gjpnwmdpz7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Foundations of Machine Learning

DSA 5102 • Lecture 11

Li Qianxiao
Department of Mathematics
So far
We introduced two classes of machine learning problems
• Supervised Learning
• Unsupervised Learning

Today, we will look at another class of problems that lies


somewhere in between, called reinforcement learning
Motivation
Interactions with
environment

Learning from Experience


Some General
Observations
of Learning
Reward vs Demonstrations

Planning
All of what we mean by goals and
purposes can be well thought of
as the maximization of the
The Reward expected value of the cumulative
Hypothesis sum of a received scalar signal
(called reward).
Examples
• Studying and getting good grades
• Learning to play a new musical instrument
• Winning at chess
• Navigating a maze
• An infant learning to walk
The Basic Components
Environment

Action

State
Interpreter

Reward

Agent
Examples

Task Agent Environment Interpreter Reward


Chess Player Board state Vision Win/loss at the
end
Learning to Infant The world Senses Not falling,
Walk getting to places
Navigating a Player The maze Vision Getting out of
maze the maze
Key Differences in Reinforcement
Learning

• Vs unsupervised learning: not completely unsupervised due


to a reward signal

• Vs supervised learning: not completely supervised, since


optimal actions to take are never given
Example: The Recycling Robot
Actions
• Search for cans
• Pick up or drop cans
• Stop and wait
• Go back and charge

Rewards:
• +10 for each can picked up
• -1 for each meter moved
• -1000 for running out of battery
Another Example
The Reinforcement Learning Problem
The RL problem can be posed as follows:

An agent navigates an environment through the lens of an


interpreter. It interacts with the environment through performing
actions, and the environment in turn provides the agent with a
reward signal. The agent’s goal is to learn through experience
how to maximize the long term accumulated reward.
Finite Markov Decision Processes
Finite State, Discrete Time Markov
Chains
• Sequence of time steps:
• State space: such that
• States:

The states forms a stochastic process, and evolves according to a


transition probability
Markov Property and Time
Homogeneity
Markov Property

Time Homogeneous Markov Chain


• The transition probability is independent of time, i.e.

• The matrix is called the transition (probability) matrix


Example State space:
Transition probability:
𝑠1

𝑠2 Transition probability matrix

𝑠3

https://en.wikipedia.org/wiki/Markov_chain
Non-Markovian or Non-time-
homogeneous Stochastic Processes
Example of non-Markovian process
• Drawing without replacement coins out of a bag of coins
consisting of 10 of each $1, 50c and 10c coins. Let be the total
value of coins drawn up to time

Example of non-time-homogeneous process


• Drawing coins at time
Essential Components are Markov
Decision Processes
Markov decision processes (MDP) is a generalization of Markov
processes, with actions and rewards

Essential elements
• Sequence of time steps:
• States:
• Actions: (union over all )
• Rewards:
State Evolution

Agent

Reward
State Action

Environment
Reward (Interpreter)
State
Transition Probability
For Markov chains, we have the transition probability

For Markov decision processes, we need to account for


additionally:
• The reward
• The action

Hence, we specific the MDP transition probability


Markov Decision Processes
A Markov decision process (MDP) is the evolution of according
to

A MDP is finite if is finite and is finite for each


Example: The Recycling Robot
State:
(position, charge, weight)
Actions:

• If then is empty
• If , such that , and has a can, then

• …
Reward:

Charging
Station
The “Decision” Aspect: The Policy
The only way the agent has control over this system is through the
choice of actions.

This is done by a specifying a policy

Deterministic policies:
Then we write , i.e. deterministic policies are functions
The Goal of Choosing a Policy:
Returns
We want to maximize long-term rewards…

Define the return

Here, is the discount rate

This includes both finite and infinite time MDPs.


The Objective of RL
The goal of RL is the maximize, by choosing a good policy , the
expected return

where we start from some state .


We will consider time-homogeneous cases where this is the same
as
Dynamic Programming
Example
1 +2 3 -1
+3
+1
+1 +4
4
0

+3 -2
+0 7
-3
+5 -1

How long does it take to check all possibilities?


The Curse of Dimensionality
A term coined by R. Bellman (1957)

The number of states grows exponentially when the


dimensionality of the problem increases

Can we have a non-brute-force algorithm?


Dynamic Programming Principle
On an optimal path (following the optimal policy), if we start
at any state in that path, the rest of path must again be
optimal
Dynamic Programming in Action
Define

+4 +2 +2 -1
+4 +2 +3 +3
+1 +2 +1
+1 +4
+5
+6
+6
+3 -2
+0 +5 +1
+6 +1 -3
-3
+6 +5 -4 -1
The Complexity of Dynamic
Programming

We have shown that brute-force search takes at least steps.

What about dynamic programming?


Summary of Key Ideas

Come up with Find optimal


Come up with
a recursive policy by
a measure of
way to acting greedily
“value” of
compute the according to
each state
value the value
Bellman’s Equations and Optimal
Policies
Value Function
As motivated earlier, we define the value function

and the action value function

Our goal: derive a recursion for and

These are known as Bellman’s equations


Relationship between and
Using the definitions, we can show the following relationships:

Combining, we get

This is known as the Bellman’s equation for the value function


Bellman’s Equation
For finite MDPs, the Bellman’s equation can be written as

This is a linear equation, and we can show that there exists a


unique solution for .
In fact, it is just

whose existence and uniqueness follow from the invertibility of ,


which in turn follows from .
Bellman’s Equation for Action-Value
Function
Using similar methods, one can show that the action value
function satisfies a similar recursion

Exercise: derive this equation and show that there exists a unique
solution
Comparing Policies
We can compare policies via their values
• Given , we say if for all
• This is a partial order

Examples
• , , Then
• , , Then neither nor holds
Optimal Policy
We define an optimal policy to be any policy satisfying

In other words, for all and all


• Does such a exist?
• Is it unique?
Policy Improvement
We can derive the following result:

For any two policies , if

Then we must have

In addition, if the first inequality is strict for some , then the


second equality is strict for at least one .
Bellman’s Optimality Condition
A policy is optimal if and only if for any state-action pair such
that , we have

This means that an optimal policy must choose an action that


maximize its associated action value function.

This then implies the existence of an optimal policy!


Bellman’s Optimality Equation
Corresponding to an optimal policy

we obtain the following recursion

These are known as the Bellman’s optimality equations


Some Remarks
The optimal value function is unique. Is an optimal policy
unique?

Observe that the policy generated from can be taken to be


deterministic

In fact, for every policy there exists a deterministic policy such


that !
Summary
We introduced
• Basic formulation of reinforcing learning
• MDP as the mathematical framework
• Bellman’s equations characterizing optimal policies

Next time: algorithms to solve RL problems

You might also like