0% found this document useful (0 votes)
10 views11 pages

Unit 1-RL

Uploaded by

rounakhara25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views11 pages

Unit 1-RL

Uploaded by

rounakhara25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

UNIT-1

Chapter-1: Introduction

1.1. Elements of Reinforcement learning


What RL is about:

 trial and error search, delayed reward


 exploration vs exploitation tradeoff

Main elements : agent and environment

Fig 1.1. Agent and Environment interactions


Sub elements:

 policy π, mapping from states to actions. May be stochastic


 reward signal: given by the environment at each time step
 value function of a state V(s): total amount of reward an agent can expect to accumulate
over the future, starting from that state
 [optional] a model of the environment: allows inferences to be made about how the
environment will behave

1.2. Limitations and scope


 state signal design is overlooked
 focus on value function estimation (no evolutionary methods)

1.3. An Extended Example: Tic-Tac-Toe


To illustrate the general idea of reinforcement learning and contrast it with other
approaches, we next consider a single example in more detail.

Consider the familiar child's game of tic-tac-toe. Two players take turns playing on a
three-by-three board. One player plays Xs and the other Os until one player wins by
placing three marks in a row, horizontally, vertically, or diagonally, as the X player
has in this game:
If the board fills up with neither player getting three in a row, the game is a draw.
Because a skilled player can play so as never to lose, let us assume that we are
playing against an imperfect player, one whose play is sometimes incorrect and
allows us to win. For the moment, in fact, let us consider draws and losses to be
equally bad for us. How might we construct a player that will find the imperfections
in its opponent's play and learn to maximize its chances of winning?

Although this is a simple problem, it cannot readily be solved in a satisfactory way


through classical techniques. For example, the classical "minimax" solution from
game theory is not correct here because it assumes a particular way of playing by the
opponent. For example, a minimax player would never reach a game state from which
it could lose, even if in fact it always won from that state because of incorrect play by
the opponent. Classical optimization methods for sequential decision problems, such
as dynamic programming, can compute an optimal solution for any opponent, but
require as input a complete specification of that opponent, including the probabilities
with which the opponent makes each move in each board state. Let us assume that this
information is not available a priori for this problem, as it is not for the vast majority
of problems of practical interest. On the other hand, such information can be
estimated from experience, in this case by playing many games against the opponent.
About the best one can do on this problem is first to learn a model of the opponent's
behavior, up to some level of confidence, and then apply dynamic programming to
compute an optimal solution given the approximate opponent model. In the end, this
is not that different from some of the reinforcement learning methods.

An evolutionary approach to this problem would directly search the space of possible
policies for one with a high probability of winning against the opponent. Here, a
policy is a rule that tells the player what move to make for every state of the game--
every possible configuration of X s and Os on the three-by-three board. For each
policy considered, an estimate of its winning probability would be obtained by
playing some number of games against the opponent. This evaluation would then
direct which policy or policies were considered next. A typical evolutionary method
would hill-climb in policy space, successively generating and evaluating policies in an
attempt to obtain incremental improvements. Or, perhaps, a genetic-style algorithm
could be used that would maintain and evaluate a population of policies. Literally
hundreds of different optimization methods could be applied. By directly searching
the policy space we mean that entire policies are proposed and compared on the basis
of scalar evaluations.

Here is how the tic-tac-toe problem would be approached using reinforcement


learning and approximate value functions. First we set up a table of numbers, one for
each possible state of the game. Each number will be the latest estimate of the
probability of our winning from that state. We treat this estimate as the state's value,
and the whole table is the learned value function. State A has higher value than state
B, or is considered "better" than state B, if the current estimate of the probability of
our winning from A is higher than it is from B. Assuming we always play X s, then
for all states with three Xs in a row the probability of winning is 1, because we have
already won. Similarly, for all states with three Os in a row, or that are "filled up," the
correct probability is 0, as we cannot win from them. We set the initial values of all
the other states to 0.5, representing a guess that we have a 50% chance of winning.

We play many games against the opponent. To select our moves we examine the
states that would result from each of our possible moves (one for each blank space on
the board) and look up their current values in the table. Most of the time we
move greedily, selecting the move that leads to the state with greatest value, that is,
with the highest estimated probability of winning. Occasionally, however, we select
randomly from among the other moves instead. These are called exploratory moves
because they cause us to experience states that we might otherwise never see. A
sequence of moves made and considered during a game can be diagrammed as in
Figure
Figure 1.1: A sequence of tic-tac-toe moves. The solid lines represent the moves
taken during a game; the dashed lines represent moves that we (our
reinforcement learning player) considered but did not make. Our second move
was an exploratory move, meaning that it was taken even though another sibling
move, the one leading to , was ranked higher. Exploratory moves do not result
in any learning, but each of our other moves does, causing backups as suggested
by the curved arrows and detailed in the text.

While we are playing, we change the values of the states in which we find ourselves
during the game. We attempt to make them more accurate estimates of the
probabilities of winning. To do this, we "back up" the value of the state after each
greedy move to the state before the move, as suggested by the arrows in Figure 1.1.
More precisely, the current value of the earlier state is adjusted to be closer to the
value of the later state. This can be done by moving the earlier state's value a fraction
of the way toward the value of the later state. If we let denote the state before the
greedy move, and the state after the move, then the update to the estimated value
of , denoted , can be written as

where is a small positive fraction called the step-size parameter, which influences
the rate of learning. This update rule is an example of a temporal-difference learning
method, so called because its changes are based on a difference, ,
between estimates at two different times.
CHAPTER-2 MULTI-ARMED BANDIT

2.1. A k-armed bandit problem


 extension of the bandit setting, inspired by slot machines that are sometimes
called“onearmed bandits”. Here we have k levers (k-arms).
 repeated choice among k actions
 after each choice we receive a reward (chosen from a stationary distribution that depends on
the action chosen)
 objective: maximize the total expected reward over some time period

Each of the k actions has a mean reward that we call the value of the action (q-value).

 At action selected at time step t


 Rt: corresponding reward
 q∗(a): value of the action

q∗(a)≐E[Rt|At=a]
If we knew the values of each action it would be trivial to solve the problem. But we don’t have
them so we need estimates. We call Qt(a) the estimated value for action a at time t.

 greedy action: take the action with the highest current estimate. That’s exploitation
 nongreedy actions allow us to explore and build better estimates

2.2. Action-value methods


= methods for computing our Qt(a) estimates. One natural way to do that is to average the
rewards actually received (sample average):
Qt(a)≐sum of rewards when a taken prior to tnumber of times a taken prior to
t=∑t−1i=1Ri⋅1Ai=a∑t−1i=11Ai=a≐sum of rewards when a taken prior to tnumber of times a
taken prior to t=∑=1
As the denominator goes to infinity, by the law of large numbers Qt(a) converges to q∗(a).
Now that we have an estimate, how do we use it to select actions?

Greedy action selection: At≐argmaxaQt(a)≐argmax


Alternative: ε-greedy action selection (behave greedily except for a small proportion ε of the
actions where we pick an action at random)
2.3. The 10-armed testbed
Test setup: set of 2000 10-armed bandits in which all of the 10 action values are selected
according to a Gaussian with mean 0 and variance 1.
When testing a learning method, it selects an action At and the reward is selected from a
Gaussian with mean q∗(At) and variance 1.
TL;DR : ε -greedy >> greedy

2.4. Incremental implementation


How can the computation of the Qt(a)estimates be done in a computationally effective manner?
For a single action, the estimate Q of this action value after it has been selected n−1−1 times is:
Qn=R1+R2+...Rn−1n−1
We’re not going to store all the values of R and recompute the sum at every time step, so we’re
going to use a better update rule, which is equivalent:
Qn+1=Qn+1n[Rn−Qn]
general form:

New Estimate ←Old Estimate + Step Size[Target−Old Estimate]


The expression [Target−Old Estimate][ is the error in the estimate.

2.5. Tracking a non-stationary problem


 Now the reward probabilities change over time
 We want to give more weight to the most recent rewards

One easy way to do it is to use a constant step size so Qn+1 becomes a weighted average of past
rewards and the initial estimate Qn.Qn+1≐Qn+α[Rn−Qn]
Which is equivalent to:

Qn+1=(1−α)nQ1+n∑i=1α(1−α)n−iRi

 we call that a weighted average because the sum of the weights is 1


 the weight given to Ri depends on how many rewards ago (n−i)it was observed
 since (1−α)<1, the weight given to Ri decreases as the number of intervening rewards
increases
2.6. Optimistic initial values
 a way to encourage exploration, which is effective in stationary problems
 select initial estimates to a high number (higher than the reward we can actually get in the
environment) so that unexplored states have a higher value than explored ones and are
selected more often

2.7. Upper-Confidence-Bound (UCB) Action


Selection
 another way to tackle the exploration problem
 ε-greedy methods choose randomly the non-greedy actions
 maybe there is a better way to select those actions according to their potential of being
optimal

At≐argmaxa[Qt(a)+c√lntNt(a)]≐argmax

 lntln increases at each timestep


 Nt(a) is the number of times action a has been selected prior to time t, so it increases every
time a is selected
 c>0>0 is our confidence level that controls the degree of exploration

The square-root term is a measure of the uncertainty or variance in the estimate of a’s value. The
quantity being max’ed over is thus a sort of upper bound on the possible true value of action a,
with c determining the confidence level.

 Each time a is selected the uncertainty is reduced (Nt(a) increments and the uncertainty
term descreases)
 Each time another action is selected, lntln increases but not Nt(a) so the uncertainty
increases
2.8. Gradient Bandit Algorithms
 another way to select actions
 so far we estimate action values (q-values) and use these to select actions
 here we compute a numerical preference for each action a, which we denote Ht(a) we select
the action with a softmax (introducing the πt(a) notation)

Algorithm based on stochastic gradient ascent: on each step, after selecting action Atand
receiving reward Rt, the action preferences are updated by:
Ht+1(At)≐Ht(At)+α(Rt−Rt)(1−πt(At))
and for all a≠At:
Ht+1(a)≐Ht(a)+α(Rt−¯Rt)πt(a) α<0<0 is the step size

 Rtis the average of all rewards up through and including time t

The ¯¯ term serves as a baseline with which the reward is compared. If the reward is higher than
the baseline, then the probability of taking At in the future is increased, and if the reward is
below the baseline, then the probability is decreased. The non-selected actions move in the
opposite direction.

2.9. Associative search (contextual bandits)


 nonassociative tasks: no need to associate different actions with different situations
 extend the nonassociative bandit problem to the associative setting
 at each time step the bandit is different
 learn a different policy for different bandits
 it opens a whole set of problems and we will see some answers in the next chapter

2.10. Summary
 one key topic is balancing exploration and exploitation.
 Exploitation is straightforward: we select the action with the highest estimated value
(we exploit our current knowledge)
 Exploration is trickier, and we have seen several ways to deal with it:
o ε-greedy choose randomly
o UCB favors the more uncertain actions
o gradient bandit estimate a preference instead of action values
Which one is best (evaluated on the 10 armed testbed)?

Fig 2.1. Evaluation of several exploration methods on the 10-armed testbed

You might also like