Unit 1-RL
Unit 1-RL
Chapter-1: Introduction
Consider the familiar child's game of tic-tac-toe. Two players take turns playing on a
three-by-three board. One player plays Xs and the other Os until one player wins by
placing three marks in a row, horizontally, vertically, or diagonally, as the X player
has in this game:
If the board fills up with neither player getting three in a row, the game is a draw.
Because a skilled player can play so as never to lose, let us assume that we are
playing against an imperfect player, one whose play is sometimes incorrect and
allows us to win. For the moment, in fact, let us consider draws and losses to be
equally bad for us. How might we construct a player that will find the imperfections
in its opponent's play and learn to maximize its chances of winning?
An evolutionary approach to this problem would directly search the space of possible
policies for one with a high probability of winning against the opponent. Here, a
policy is a rule that tells the player what move to make for every state of the game--
every possible configuration of X s and Os on the three-by-three board. For each
policy considered, an estimate of its winning probability would be obtained by
playing some number of games against the opponent. This evaluation would then
direct which policy or policies were considered next. A typical evolutionary method
would hill-climb in policy space, successively generating and evaluating policies in an
attempt to obtain incremental improvements. Or, perhaps, a genetic-style algorithm
could be used that would maintain and evaluate a population of policies. Literally
hundreds of different optimization methods could be applied. By directly searching
the policy space we mean that entire policies are proposed and compared on the basis
of scalar evaluations.
We play many games against the opponent. To select our moves we examine the
states that would result from each of our possible moves (one for each blank space on
the board) and look up their current values in the table. Most of the time we
move greedily, selecting the move that leads to the state with greatest value, that is,
with the highest estimated probability of winning. Occasionally, however, we select
randomly from among the other moves instead. These are called exploratory moves
because they cause us to experience states that we might otherwise never see. A
sequence of moves made and considered during a game can be diagrammed as in
Figure
Figure 1.1: A sequence of tic-tac-toe moves. The solid lines represent the moves
taken during a game; the dashed lines represent moves that we (our
reinforcement learning player) considered but did not make. Our second move
was an exploratory move, meaning that it was taken even though another sibling
move, the one leading to , was ranked higher. Exploratory moves do not result
in any learning, but each of our other moves does, causing backups as suggested
by the curved arrows and detailed in the text.
While we are playing, we change the values of the states in which we find ourselves
during the game. We attempt to make them more accurate estimates of the
probabilities of winning. To do this, we "back up" the value of the state after each
greedy move to the state before the move, as suggested by the arrows in Figure 1.1.
More precisely, the current value of the earlier state is adjusted to be closer to the
value of the later state. This can be done by moving the earlier state's value a fraction
of the way toward the value of the later state. If we let denote the state before the
greedy move, and the state after the move, then the update to the estimated value
of , denoted , can be written as
where is a small positive fraction called the step-size parameter, which influences
the rate of learning. This update rule is an example of a temporal-difference learning
method, so called because its changes are based on a difference, ,
between estimates at two different times.
CHAPTER-2 MULTI-ARMED BANDIT
Each of the k actions has a mean reward that we call the value of the action (q-value).
q∗(a)≐E[Rt|At=a]
If we knew the values of each action it would be trivial to solve the problem. But we don’t have
them so we need estimates. We call Qt(a) the estimated value for action a at time t.
greedy action: take the action with the highest current estimate. That’s exploitation
nongreedy actions allow us to explore and build better estimates
One easy way to do it is to use a constant step size so Qn+1 becomes a weighted average of past
rewards and the initial estimate Qn.Qn+1≐Qn+α[Rn−Qn]
Which is equivalent to:
Qn+1=(1−α)nQ1+n∑i=1α(1−α)n−iRi
At≐argmaxa[Qt(a)+c√lntNt(a)]≐argmax
The square-root term is a measure of the uncertainty or variance in the estimate of a’s value. The
quantity being max’ed over is thus a sort of upper bound on the possible true value of action a,
with c determining the confidence level.
Each time a is selected the uncertainty is reduced (Nt(a) increments and the uncertainty
term descreases)
Each time another action is selected, lntln increases but not Nt(a) so the uncertainty
increases
2.8. Gradient Bandit Algorithms
another way to select actions
so far we estimate action values (q-values) and use these to select actions
here we compute a numerical preference for each action a, which we denote Ht(a) we select
the action with a softmax (introducing the πt(a) notation)
Algorithm based on stochastic gradient ascent: on each step, after selecting action Atand
receiving reward Rt, the action preferences are updated by:
Ht+1(At)≐Ht(At)+α(Rt−Rt)(1−πt(At))
and for all a≠At:
Ht+1(a)≐Ht(a)+α(Rt−¯Rt)πt(a) α<0<0 is the step size
The ¯¯ term serves as a baseline with which the reward is compared. If the reward is higher than
the baseline, then the probability of taking At in the future is increased, and if the reward is
below the baseline, then the probability is decreased. The non-selected actions move in the
opposite direction.
2.10. Summary
one key topic is balancing exploration and exploitation.
Exploitation is straightforward: we select the action with the highest estimated value
(we exploit our current knowledge)
Exploration is trickier, and we have seen several ways to deal with it:
o ε-greedy choose randomly
o UCB favors the more uncertain actions
o gradient bandit estimate a preference instead of action values
Which one is best (evaluated on the 10 armed testbed)?