Chapter 2
Chapter 2
• Methods exist to find the optimal policy without a model, by querying the
environment. These methods are named model-free methods.
• Value- based model-free methods are the most basic learning approach of
reinforcement learning.
• They work well in problems with deterministic environments and discrete action
spaces, such as mazes and games.
• Model-free learning makes few demands on the environment, building up the
policy function 𝜋(𝑠) → 𝑎 by sampling the environment.
Grid Worlds
the agent can find the best action for each state.
State Representation
Transition 𝑻𝒂
• The transition function 𝑇 𝑎 determines how the state changes after an action has
been selected.
• In model-free reinforcement learning the transition function is implicit to the
solution algorithm: the environment has access to the transition function, and
uses it to compute the next state 𝑠′, but the agent has not.
• In model-based reinforcement learning, the agent has its own transition
function, an approximation of the environment’s transition function, which is
learned from the environment feedback.
• The dynamics of the MDP are modelled by transition function 𝑇𝑎 (·) and reward
function 𝑅𝑎 (·).
• The imaginary space of all possible states is called the state space.
• The two functions define a two-step transition from state 𝑠 to 𝑠 , via action
𝑎: 𝑠 → 𝑎 → 𝑠′.
• A graph such as the one in the center and right panel of Fig. 2.5, where child
nodes have only one parent node and without cycles, is known as a tree.
• The root node at the top is state 𝑠.
• As actions are performed and states and rewards are returned backup the tree, a
learning process is taking place in the agent.
• In the tree of Fig. 2.5 an action selection moves downward, towards the leaves.
At the deeper states, we find the rewards, which we propagate to the parent
states upwards.
• Reward learning is learning by backpropagation.
• Action selection moves down, reward learning flows up.
• Reinforcement learning is learning by trial and error.
• Trial is selecting an action down (using the behavior policy) to perform in the
environment.
• Error is moving up the tree, receiving a feedback reward from the environment,
and reporting that back up the tree to the state to update the current behavior
policy.
• The downward selection policy chooses which actions to explore, and the
upward propagation of the error signal performs the learning of the policy.
Reward 𝑹𝒂
Discount Factor 𝜸
• We distinguish between two types of tasks: (1) continuous time, long running,
tasks, and (2) episodic tasks—tasks that end.
• In continuous and long running tasks it makes sense to discount rewards from
far in the future in order to more strongly value current information at the present
time. To achieve this a discount factor 𝛾 is used in our MDP that reduces the
impact of far away rewards.
• Many continuous tasks use discounting, 𝛾 ≠ 1.
• Both the supermarket example and the game of chess are episodic, and
discounting does not make sense in these problems, 𝛾 = 1.
Policy 𝝅
• The policy 𝜋 is a conditional probability distribution that for each possible state
specifies the probability of each possible action.
• The function 𝜋 is a mapping from the state space to a probability distribution
over the action space:
𝜋 : 𝑆 → 𝑝(𝐴)
where 𝑝 ( 𝐴) can be a discrete or continuous probability distribution.
• For a particular probability (density) from this distribution we write
𝜋(𝑎|𝑠)
• A special case of a policy is a deterministic policy, denoted by 𝜋(𝑠) where 𝜋:𝑆→𝐴.
• A deterministic policy selects a single action in every state.
Trace 𝝉
• Note that if we would use an infinite-horizon return (Eq. 2.2) and 𝛾 = 1.0, then the
cumulative reward may become unbounded. Therefore, in continuous problems,
we use a discount factor close to 1.0, such as 𝛾 = 0.99.
State Value 𝑽
• The environment can be stochastic, and so can our policy, and for a given policy
we do not always get the same trace.
• We are actually interested in the expected cumulative reward that a certain
policy achieves.
• The expected cumulative discounted future reward of a state is better known as
the value of that state.
• Every policy 𝜋 has one unique associated value function 𝑉^𝜋 (𝑠). We often omit 𝜋
to simplify notation, simply writing 𝑉 (𝑠), knowing a state value is always
conditioned on a certain policy.
• The state value is defined for every possible state 𝑠 ∈ 𝑆.
• Finally, the state value of a terminal state is by definition zero.
State-Action Value 𝑸
• Every policy 𝜋 has only one unique associated state-action value function 𝑄 𝜋 (𝑠,
𝑎).
• We often omit 𝜋 to simplify notation.
• The state-action value of a terminal state is by definition zero.
• Note the recursion on the value function, and that for the Bellman equation the
transition and reward functions must be known for all states by the agent.
• Together, the transition and reward model are referred to as the dynamics model
of the environment.
• The dynamics model is often not known by the agent, and model-free methods
have been developed to compute the value function and policy function without
them.
Model-Free Learning
• Frequently, we are in a situation when the transition probabilities are not known
to the agent, and we need other methods to compute the policy function. For
this situation, model-free algorithms have been developed.
• We will see how, when the agent does not know the transition function, an
optimal policy can be learned by sampling rewards from the environment.
• These algorithms are based on a few principles. First, we will discuss how the
principle of sampling can be used to construct a value function.
• Second, we will discuss which mechanisms for action selection exist, where we
will encounter the exploration/exploitation trade-oI.
• Third, we will discuss how to learn from the rewards of the selected actions.
• Finally, we will discuss two full algorithms in which all these concepts come
together: SARSA and Q-learning.
• Note the introduction of 𝛼, the learning rate, which controls how fast the
algorithm learns (bootstraps).
• It is an important parameter; setting the value too high can be detrimental since
the last value then dominates the bootstrap process too much.
• The last term −𝑉(𝑠) subtracts the value of the current state, to compute the
temporal diIerence.
• Another way to write this update rule is:
as the diIerence between the new temporal diIerence target and the old value.
• Note the absence of transition model 𝑇 in the formula; temporal diIerence is a
model-free update formula.
Bias-Variance Trade-oI
• A crucial diIerence between the Monte Carlo method and the temporal
diIerence method is the use of bootstrapping to calculate the value function.
• The use of bootstrapping has an important consequence: it trades oI bias and
variance.
• Monte Carlo does not use bootstrapping. It performs a full episode with many
random action choices before it uses the reward. As such, its action choices are
unbiased (they are fully random), they are not influenced by previous reward
values.
• However, the fully random choices also cause a high variance of returns between
episodes.
• Monte Carlo is a low-bias/high-variance algorithm.
• In contrast, temporal diIerence bootstraps the 𝑄-function with the values of the
previous steps, refining the function values with the rewards after each single
step.
• Thus, because of bootstrapping, TD is a high-bias/low variance method.
• Both approaches have their uses in diIerent circumstances. In fact, we can think
of situations where a middle ground (of medium bias/medium variance) might be
useful.
• This is the idea behind the so-called n-step approach: do not sample a full
episode, and also not a single step, but sample a few steps at a time before using
the reward values.
• The n-step algorithm has medium bias and medium variance.
• In the value-based approach we know the value functions 𝑉(𝑠) or𝑄(𝑠,𝑎). How
can that help us to find action 𝑎?
• In a discrete action space, there is at least one discrete action with the highest
value. Thus, if we have the optimal state-value 𝑉* , then the optimal policy can
be found by finding the action with that value.
• This relationship is given by
• In this way the optimal policy sequence of best actions 𝜋★(𝑠) can be recovered
from the values, hence the name value-based method.
• A full reinforcement learning algorithm consists of a rule for the selection part
(downward) and a rule for the learning part (upward).
Exploration
Bandit Theory
𝝐-greedy Exploration
OI-Policy Learning
• The question is whether the agent should perform updates strictly on-policy—
only learning from its most recent action—or allow oI-policy updates, learning
from all available information.
• In on-policy learning, the learning takes place by using the value of the action
that was selected by the policy. The policy determines the action to take, and the
value of that action is used to update the value of the policy function: the
learning is on-policy.
• In oI-policy methods, the learning takes place by backing up values of another
action, not necessarily the one selected by the behavior policy.
• This method makes sense when the agent explores.
• When the behavior policy explores, it selects a non-optimal action.
• The diIerence between on-policy and oI-policy is only in how they act when
exploring the non-greedy action.
• In the case of exploration, oI-policy learning can be more eIicient, by not
stubbornly backing up the value of the action selected by the behavior policy, but
the value of an older, better, action.
• An important point is that the convergence behavior of on-policy and oI-policy
learning is diIerent.
• In general, tabular reinforcement learning have been proven to converge when
the policy is greedy in the limit with infinite exploration.
• OI-policy methods learn from the greedy rewards and thus converge to the
optimal policy, after having sampled enough states.
• However, on- policy methods with a fixed 𝜖 do not converge to the optimal policy,
since they keep selecting explorative actions.
• When we use a variable-𝜖-policy in which the value of 𝜖 goes to zero, then on-
policy methods do converge, since then they choose, in the limit, the greedy
action.
On-Policy SARSA
• On-policy learning selects an action, evaluates it in the environment, and follows
the actions, guided by the behavior policy.
• On-policy learning samples the state space following the behavior policy, and
improves the policy by backing up values of the selected actions.
• SARSA updates its Q-values using the Q-value of the next state 𝑠 and the current
policy’s action.
• The primary advantage of on-policy learning is its predictive behavior.
OI-Policy Q-Learning
• OI-policy learning is more complicated; it may learn its policy from actions that
are diIerent from the one just taken.
• The only diIerence from on-policy learning is that the 𝛾𝑄(𝑠𝑡+1, 𝑎𝑡+1) term from
Eq. 2.9 has been replaced by 𝛾 max𝑎 𝑄(𝑠𝑡+1, 𝑎).
• Indeed, the term temporal diIerence learning is sometimes used for the Q-
learning algorithm. Note that Q-learning uses bootstrapping.
• The reason that Q-learning is called oI-policy is that it updates its Q-values
using the Q-value of the next state 𝑠𝑡+1, and the greedy action (not necessarily
the behavior policy’s action—it is learning oI the behavior policy).
• Environments in which a reward exists in each state are said to have a dense
reward structure.
• For other environments rewards may exist for only some of the states. For
example, in chess, rewards only exist at terminal board positions where there is a
win or a draw. In all other states the return depends on the future states and
must be calculated by the agent by propagating reward values from future states
up towards the root state 𝑠0. Such an environment is said to have a sparse
reward structure.
• Finding a good policy is more complicated when the reward structure is sparse.
• A graph of the landscape of such a sparse reward function would show a flat
landscape with a few sharp mountain peaks.
• Finding the optimum in a flat landscape where the gradient is zero, is hard.
• Reward shaping can make all the diIerence when no solution can be found with
a naive reward function.
• It is a way of incorporating heuristic knowledge into the MDP. The use of
heuristics on board games such as chess and checkers can also be regarded as
reward shaping.
• A choice close to 1 for the discount parameter is usually a good start, and a
choice close to 0 for the learning rate is a good start.
Conclusion
• Model-free methods use actions that are irreversible for the agent.
• The backup rule for learning is based on bootstrapping, and can follow the
rewards of the actions on-policy, including the value of the occasional
explorative action, or oI-policy, always using the value of the best action.
Summary
• Games and robotics are two important fields of application. Fields of application
can be episodic (they end—such as a game of chess) or continuous (they do not
end—a robot remains in the world).
• In continuous problems it often makes sense to discount behavior that is far
from the present, episodic problems typically do not bother with a discount
factor—a win is a win.
• Environments can be deterministic (many board games are deterministic—
boards don’t move) or stochastic (many robotic worlds are stochastic—the
world around a robot moves).
• The optimal policy can be found by finding the maximal value of a state. The
value function 𝑉 (𝑠) returns the expected reward for a state.
• When the transition function 𝑇𝑎 (𝑠, 𝑠′) is present, the agent can use Bellman’s
equation, or a dynamic programming method to recursively traverse the behavior
space. Value iteration is one such dynamic programming method.
• Value iteration traverses all actions of all states, backing up reward values, until
the value function stops changing. The state-action value 𝑄(𝑠, 𝑎) determines the
value of an action of a state.
• Bellman’s equation calculates the value of a state by calculating the value of
successor states. Accessing successor states (by following the action and
transition) is also called expanding a successor state.
• In a tree diagram successor states are called child nodes, and expanding is a
downward action. Backpropagating the reward values to the parent node is a
movement upward in the tree.
• Methods where the agent makes use of the transition model are called model-
based methods. When the agent does not use the transition model, they are
model- free methods.
• Value-based model-free methods can find an optimal policy by using only
irreversible actions, sampling the environment to find the value of the actions.
• A major determinant in model-free reinforcement learning is the
exploration/exploitation trade-oI, or how much of the information that has been
learned from the environment is used in choosing actions to sample.
• A well-known exploration/exploitation method is 𝜖- greedy, where the greedy
(best) action is followed from the behavior policy, except in 𝜖 times, when
random exploration is performed. Always following the policy’s best action runs
the risk of getting stuck in a cycle. Exploring random nodes allows breaking free
of such cycles.
• How should we process the rewards that are found at nodes? Here we
introduced another fundamental element of reinforcement learning:
bootstrapping, or finding a value by refining a previous value. Temporal diIerence
learning uses the principle of bootstrapping to find the value of a state by adding
appropriately discounted future reward values to the state value function.
Book Questions
• In reinforcement learning the agent can choose which training examples are
generated. Why is this beneficial? What is a potential problem?
• Which five elements does an MDP have to model reinforcement learning problems?
• Down.
• Up.
• What is 𝜏?
• What is 𝜋(𝑠)?
• What is 𝑉(𝑠)?
• The action-value function that estimates the value of taking a specific action in a state.
• A method for solving complex problems by breaking them down into simpler
subproblems .
• What is recursion?
• The process of solving a problem where the solution involves solving smaller
instances of the same problem .
• Value iteration.
• Discrete.
• Continuous.
• Deterministic.
• Stochastic.
• What type of action space and what type of environment are suited for value-based
methods?
• Why are value-based methods used for games and not for robotics?