Module 02
Module 02
Online Learning
Reinforcement Learning : Module 02
Figure : Average performance of ε-greedy action-value methods on the 10-armed testbed. These data are
averages over 2000 runs with different bandit problems. All methods used sample averages as their action-
value estimates.
ε-Greedy : 10-Armed Bandit Performance
Figure : Average performance of ε-greedy action-value methods on the 10-armed testbed. These data are
averages over 2000 runs with different bandit problems. All methods used sample averages as their action-
value estimates.
One More Time : Armed Bandit Example
• 3 Restaurant Example
• Average Happiness 10, 8 and 5
• 300 Days to visit
• Explore Only
• Average Happiness : 2300, Regret : 700
• Exploit Only
• Day 1 : R1 5, Day 2 : R2 8, Day 3 : R3 5
• So we continue with R2 for rest 297 Days
• ε Greedy : 10%
Bandit Example
• Bandit example Consider a k-armed bandit problem with k = 4
actions, denoted 1, 2, 3, and 4. Consider applying to this
problem a bandit algorithm using ε-greedy action selection,
sample-average action-value estimates, and initial estimates of
Q1(a) = 0, for all a. Suppose the initial sequence of actions and
rewards is A1 = 1, R1 = −1, A2 = 2, R2 = 1, A3 = 2, R3 = −2, A4 =
2, R4 = 2, A5 = 3, R5 = 0.
• On some of these time steps the ε case may have occurred,
causing an action to be selected at random. On which time
steps did this definitely occur? On which time steps could this
possibly have occurred?
• The first condition is required to guarantee that the steps are large
enough to eventually overcome any initial conditions or random
fluctuations.
• The second condition guarantees that eventually the steps become
small enough to assure convergence.
Tracking a Nonstationary Problem
• Note that both convergence conditions are met for the sample-
average case αn(a) = 1/n.
• But not for the case of constant step-size parameter, αn(a) =α, the
second condition is not met, indicating that the estimates never
completely converge but continue to vary in response to the most
recently received rewards.
• This is actually desirable in a nonstationary environment, and
problems that are effectively nonstationary are the most common in
reinforcement learning.
Tracking a Nonstationary Problem
• Also sequences of step-size parameters that meet both the conditions
often converge very slowly or need considerable tuning in order to
obtain a satisfactory convergence rate.
• Although sequences of step-size parameters that meet these
convergence conditions are often used in theoretical work, they are
seldom used in applications and empirical research.
Programming Exercise
Figure : The effect of optimistic initial action-value estimates on the 10-armed testbed. Both methods
used a constant step-size parameter, α = 0.1.
Optimistic Initial Values
• Figure shows the performance on the 10-armed bandit testbed of a greedy
method using Q1(a) = +5, for all a.
• For comparison, also shown is an ε-greedy method with Q1(a) = 0.
• Initially, the optimistic method performs worse because it explores more,
but eventually it performs better because its exploration decreases with
time.
• We call this technique for encouraging exploration optimistic initial values.
• We regard it as a simple trick that can be quite effective on stationary
problems, but it is far from being a generally useful approach to
encouraging exploration.
• For example, it is not well suited to nonstationary problems
• because its drive for exploration is inherently temporary.
Upper-Confidence-Bound Action
Selection
Upper-Confidence-Bound : Need
• Exploration is needed because there is always uncertainty about the
accuracy of the action-value estimates.
• The greedy actions are those that look best at present, but some of the
other actions may actually be better.
• ε-greedy action selection forces the non-greedy actions to be tried, but
indiscriminately, with no preference for those that are nearly greedy or
particularly uncertain.
• It would be better to select among the non-greedy actions according to
their potential for actually being optimal, taking into account both how
close their estimates are to being maximal and the uncertainties in those
estimates.
Upper-Confidence-Bound : Formula
• One effective way of doing this is to select actions according to
• Where
• ln 𝑡 denotes the natural logarithm of t (the number that e ≈ 2.71828 would
have to be raised to in order to equal t),
• 𝑁𝑡 (𝑎) denotes the number of times that action a has been selected prior to
time t
• the number c > 0 controls the degree of exploration.
• If 𝑁𝑡 (𝑎) = 0, then a is considered to be a maximizing action.
Upper-Confidence-Bound Action Selection
• The square-root term is a measure of the uncertainty or variance in the estimate
of a’s value.
• The quantity being max’ed over is thus a sort of upper bound on the possible true
value of action a, with c determining the confidence level.
• Each time a is selected the uncertainty is presumably reduced: 𝑁𝑡(𝑎) increments,
and, as it appears in the denominator, the uncertainty term decreases.
• On the other hand, each time an action other than a is selected, t increases but
𝑁𝑡 (𝑎) does not; because t appears in the numerator, the uncertainty estimate
increases.
• The use of the natural logarithm means that the increases get smaller over time,
but are unbounded.
• All actions will eventually be selected, but actions with lower value estimates, or
that have already been selected frequently, will be selected with decreasing
frequency over time.
Upper-Confidence-Bound : Performance
Figure : Average performance of UCB action selection on the 10-armed testbed. As shown,
UCB generally performs better than ε-greedy action selection, except in the first k steps, when
it selects randomly among the as-yet-untried actions
Upper-Confidence-Bound : Comments
• UCB often performs well, as shown in the figure, but is more difficult
than ε-greedy to extend beyond bandits to the more general
reinforcement learning settings.
• Another difficulty is dealing with large state spaces, particularly when
using function approximation.
• In these more advanced settings the idea of UCB action selection is
usually not practical.
Gradient Bandit Algorithms
Gradient : The gradient provides information about the rate of change or slope of a
function at a particular point. For a function of two variables, the gradient points in
the direction of the steepest ascent on the surface defined by that function.
Gradient Bandit Algorithms
• So far we have considered methods that estimate action values and
use those estimates to select actions. This is often a good approach,
but it is not the only one possible.
• Let us consider learning a numerical preference for each action a,
which we denote Ht(a).
• The larger the preference, the more often that action is taken, but the
preference has no interpretation in terms of reward.
Gradient Bandit Algorithms
• Only the relative preference of one action over another is important;
if we add 1000 to all the action preferences there is no effect on the
action probabilities, which are determined according to a soft-max
distribution as follows:
Figure : Average performance of the gradient bandit algorithm with and without a reward baseline on
the 10-armed testbed when the q*(a) are chosen to be near +4 rather than near zero
Gradient Bandit Algorithms Performance
• Figure shows results with the gradient bandit algorithm on a variant
of the 10-armed testbed in which the true expected rewards were
selected according to a normal distribution with a mean of +4 instead
of zero (and with unit variance as before).
• This shifting up of all the rewards has absolutely no effect on the
gradient bandit algorithm because of the reward baseline term, which
instantaneously adapts to the new level.
• But if the baseline were omitted, then performance would be
significantly degraded, as shown in the figure.
End