Unit 03 RL Problem
Unit 03 RL Problem
Syllabus:The Agent–Environment Interface, Goals and Rewards, Returns, Unified Notation for
Episodic and Continuing Tasks, Value Functions, Optimal Value Functions, Optimality and
Approximation
More specifically, the agent and environment interact at each of a sequence of discretetime steps, t = 0, 1, 2,
3,.Onetime step later, in part as a consequence of its action, the agent receives a numericalreward, Rt+1 ∊ R ⊂
R, and finds itself in a new state, St+1. The MDP and agenttogether thereby give rise to a sequence or
trajectory that begins like this:
In a finite MDP, the sets of states, actions, and rewards (S, A, and R) all have a finitenumber of elements.In
this case, the random variables Rt and St have well defineddiscrete probability distributions dependent only
on the preceding state and action.there is a probabilityof those values occurring at time t, given particular
values of the preceding state andaction:
The function p defines the dynamics of the MDP.The dot over the equals sign in the equation reminds us that
it is a definition (in thiscase of the function p).Thedynamics function p : S x R x S x A → [0, 1] is an ordinary
deterministic function of fourarguments.The ‘|’ in the middle of it comes from the notation for conditional
probability,but here it just reminds us that p specifies a probability distribution for each choice of sand a, that
is, that
1
REINFORCEMENT LEARNING
In a Markov decision process, the probabilities given by p completely characterize theenvironment’s dynamics.
That is, the probability of each possible value for St and Rtdepends on the immediately preceding state and
action, St−1 and At−1, and, given them,not at all on earlier states and actions.
The state must include information about all aspectsof the past agent–environment interaction that make a
difference for the future. If itdoes, then the state is said to have the Markov property.
From the four-argument dynamics function, p, one can compute anything else one mightwant to know about
the environment, such as the state-transition probabilities (which wedenote, with a slight abuse of notation,
as a three-argument function p : SxSxA→ [0, 1]),
We can also compute the expected rewards for state–action pairs as a two-argumentfunction r : S x A →ℝ:
and the expected rewards for state–action–next-state triples as a three-argument functionr : S x A x S →ℝ,
The MDP framework is abstract and flexible and can be applied to many differentproblems in many different
ways. For example, the time steps need not refer to fixedintervals of real time; they can refer to arbitrary
successive stages of decision makingand acting.
Actions can be any decisions we want to learn how to make, andstates can be anything we can know that might
be useful in making them.
In particular, the boundary between agent and environment is typically not the sameas the physical boundary
of a robot’s or an animal’s body. Usually, the boundary isdrawn closer to the agent than that.
The general rule we follow is that anything that cannot be changed arbitrarily bythe agent is considered to be
outside of it and thus part of its environment. We donot assume that everything in the environment is unknown
to the agent.
in some cases the agent mayknow everything about how its environment works and still face a difficult
reinforcementlearning task, just as we may know exactly how a puzzle like Rubik’s cube works, butstill be
unable to solve it. The agent–environment boundary represents the limit of theagent’s absolute control, not of
its knowledge.
The agent–environment boundary can be located at different places for differentpurposes. In a complicated
robot, many different agents may be operating at once, eachwith its own boundary.
The MDP framework is a considerable abstraction of the problem of goal-directedlearning from interaction. It
proposes that whatever the details of the sensory, memory,and control apparatus, and whatever objective one
is trying to achieve, any problem oflearning goal-directed behavior can be reduced to three signals passing
2
REINFORCEMENT LEARNING
back and forthbetween an agent and its environment: one signal to represent the choices made by theagent
(the actions), one signal to represent the basis on which the choices are made (thestates), and one signal to
define the agent’s goal (the rewards). This framework may notbe sufficient to represent all decision-learning
problems usefully, but it has proved to bewidely useful and applicable.
Grid world Example:
where T is a final time step. This approach makes sense in applications in which thereis a natural notion of
final time step, that is, when the agent–environment interactionbreaks naturally into subsequences, which we
call episodes. Each episode ends in a specialstate called the terminal state, followed by a reset to a standard
starting state or to asample from a standard distribution of starting states.Even if you think of episodes
asending in different ways, such as winning and losing a game, the next episode beginsindependently of how
3
REINFORCEMENT LEARNING
the previous one ended. Thus, the episodes can all be considered toend in the same terminal state, with
different rewards for the different outcomes. Taskswith episodes of this kind are called episodic tasks.
On the other hand, in many cases the agent–environment interaction does not breaknaturally into identifiable
episodes, but goes on continually without limit.
We call these continuing tasks. The returnformulation (3.7) is problematic for continuing tasks because the
final time step would beT = ∞, and the return, which is what we are trying to maximize, could easily be infinite.
The additional concept that we need is that of discounting. According to this approach,the agent tries to select
actions so that the sum of the discounted rewards it receives overthe future is maximized. In particular, it
chooses Atto maximize the expected discountedreturn:
Note that although the return (3.8) is a sum of an infinite number of terms, it is stillfinite if the reward is
nonzero and constant—if γ< 1. For example, if the reward is aconstant +1, then the return is
4
REINFORCEMENT LEARNING
We number the time steps of each episode startinganew from zero. Therefore, we have to refer not just to S t,
the state representation attime t, but to St ,i, the state representation at time t of episode i.
We are almost always consideringa particular episode, or stating something that is true for all episodes.
Accordingly, inpractice we almost always abuse notation slightly by dropping the explicit reference toepisode
number. That is, we write St to refer to St,i, and so on.
We need one other convention to obtain a single notation that covers both episodicand continuing tasks. We
have defined the return as a sum over a finite number of termsin one case (3.7) and as a sum over an infinite
number of terms in the other (3.8). Thesetwo can be unified by considering episode termination to be the
entering of a specialabsorbing state that transitions only to itself and that generates only rewards of zero.
Forexample, consider the state transition diagram:
Here the solid square represents the special absorbing state corresponding to the end of anepisode. Starting
from S0, we get the reward sequence +1, +1, +1, 0, 0, 0, . . .. Summingthese, we get the same return whether we
sum over the first T rewards (here T = 3) orover the full infinite sequence. This remains true even if we
introduce discounting.
Thus,we can define the return, in generalusing the convention of omittingepisode numbers when they are not
needed, and including the possibility that γ = 1 if thesum remains defined (e.g., because all episodes terminate).
Alternatively, we can write
including the possibility that T = ∞ or γ = 1 (but not both). We use these conventionsthroughout the rest of the
book to simplify notation and to express the close parallelsbetween episodic and continuing tasks.
where Eπ[·] denotes the expected value of a random variable given that the agent followspolicy π, and t is any
time step. Note that the value of the terminal state, if any, isalways zero. We call the function vπ the state-value
function for policy π.
Similarly, we define the value of taking action a in state s under a policy π, denotedqπ(s, a), as the expected
return starting from s, taking the action a, and thereafterfollowing policy π:
Equation (3.14) is the Bellman equation for vπ. It expressesa relationship between the value of a state and
the values ofits successor states.
The Bellman equation averages over all the possibilities, weighting eachby its probability of occurring. It states
that the value of the start state must equal the(discounted) value of the expected next state, plus the reward
expected along the way.
A policy π isdefined to be better than or equal to a policy π’ if its expected return is greater thanor equal to
that of π’ for all states.
There is always at least one policy that is better than or equal to all otherpolicies. This is an optimal policy.
Although there may be more than one, we denote allthe optimal policies by π*. They share the same state-
value function, called the optimalstate-value function, denoted v*, and defined as
6
REINFORCEMENT LEARNING
Optimal policies also share the same optimal action-value function, denoted q*, anddefined as
for all s ∊ S and a ∊ A(s). For the state–action pair (s, a), this function gives theexpected return for taking actiona
in state s and thereafter following an optimal policy.Thus, we can write q* in terms of v* as follows:
Because v* is the value function for a policy, it must satisfy the self-consistencycondition given by the Bellman
equation for state values. Because it is the optimalvalue function, however, v*’s consistency condition can be
written in a special formwithout reference to any specific policy. This is the Bellman equation for v *, or
theBellman optimality equation. Intuitively, the Bellman optimality equation expresses thefact that the value
of a state under an optimal policy must equal the expected return forthe best action from that state:
The last two equations are two forms of the Bellman optimality equation for v*. TheBellman optimality
equation for q* is
The backup diagrams in the figure below show graphically the spans of future statesand actions considered in
the Bellman optimality equations for v* and q*.The backup diagram on the leftgraphically represents the
Bellman optimality equation for v* and the backup diagramon the right graphically representsBellman
optimality equation for q*.
7
REINFORCEMENT LEARNING
The Bellman optimality equation is actually a system of equations, one for each state, soif there are n states,
then there are n equations in n unknowns. If the dynamics p of theenvironment are known, then in principle
one can solve this system of equations for v* using any one of a variety of methods for solving systems of
nonlinear equations.
Once one has v*, it is relatively easy to determine an optimal policy. For each states, there will be one or more
actions at which the maximum is obtained in the Bellmanoptimality equation. Any policy that assigns nonzero
probability only to these actions isan optimal policy.
Another way of saying this is that any policy that is greedy with respect to theoptimal evaluation function v*
is an optimal policy.
Having q* makes choosing optimal actions even easier. With q*, the agent does noteven have to do a one-step-
ahead search: for any state s, it can simply find any actionthat maximizes q*(s, a). The action-value function
effectively caches the results of allone-step-ahead searches. It provides the optimal expected long-term return
as a valuethat is locally and immediately available for each state–action pair.
Hence, at the cost ofrepresenting a function of state–action pairs, instead of just of states, the optimal
actionvaluefunction allows optimal actions to be selected without having to know anythingabout possible
successor states and their values, that is, without having to know anythingabout the environment’s dynamics.