100% found this document useful (1 vote)
496 views22 pages

Reinforcement Learning - Chapter 2

The document summarizes key concepts related to reinforcement learning, including multi-armed bandit problems and Markov decision processes. It discusses exploration versus exploitation in reinforcement learning and describes the epsilon-greedy approach to balancing the two. The epsilon-greedy approach selects the highest-valued action with probability 1-epsilon and a random action with probability epsilon, allowing some exploration while still exploiting known rewards. It also provides examples of applications that can be modeled as multi-armed bandits, such as clinical trials, network routing, and online advertising.

Uploaded by

Sivasathiya G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
496 views22 pages

Reinforcement Learning - Chapter 2

The document summarizes key concepts related to reinforcement learning, including multi-armed bandit problems and Markov decision processes. It discusses exploration versus exploitation in reinforcement learning and describes the epsilon-greedy approach to balancing the two. The epsilon-greedy approach selects the highest-valued action with probability 1-epsilon and a random action with probability epsilon, allowing some exploration while still exploiting known rewards. It also provides examples of applications that can be modeled as multi-armed bandits, such as clinical trials, network routing, and online advertising.

Uploaded by

Sivasathiya G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

EASWARI ENGINEERING COLLEGE

(AUTONOMOUS)
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND
DATA SCIENCE

191AIC601T – REINFORCEMENT LEARNING

Unit II -Notes

III YEAR - B.TECH

PREPARED BY APPROVED BY

G.SIVASATHIYA,AP/AI&DS HOD/AI&DS
UNIT – 2
MULTI ARM BANDITS AND MARKOV DECISION
PROCESS

MULTI-ARMED BANDIT PROBLEM (MABP)

A bandit is defined as someone who steals your money. A one-


armed bandit is a simple slot machine wherein you insert a coin into the
machine, pull a lever, and get an immediate reward. But why is it called
a bandit? It turns out all casinos configure these slot machines in such
a way that all gamblers end up losing money!

A multi-armed bandit is a complicated slot machine wherein


instead of 1, there are several levers which a gambler can pull, with each
lever giving a different return. The probability distribution for the reward
corresponding to each lever is different and is unknown to the gambler.

The task is to identify which lever to pull in order to get maximum


reward after a given set of trials. Each arm chosen is equivalent to an
action, which then leads to an immediate reward.

Use Cases
Bandit algorithms are being used in a lot of research projects in
the industry. Listed some of their use cases in this section.

Clinical Trials

The well being of patients during clinical trials is as important as


the actual results of the study. Here, exploration is equivalent to
identifying the best treatment, and exploitation is treating patients as
effectively as possible during the trial.

Network Routing
Routing is the process of selecting a path for traffic in a network,
such as telephone networks or computer networks (internet). Allocation
of channels to the right users, such that the overall throughput is
maximised, can be formulated as a MABP.
Online Advertising
The goal of an advertising campaign is to maximise revenue from
displaying ads. The advertiser makes revenue every time an offer is
clicked by a web user. Similar to MABP, there is a trade-off between
exploration, where the goal is to collect information on an ad’s
performance using click-through rates, and exploitation, where we stick
with the ad that has performed the best so far.

Game Designing
Building a hit game is challenging. MABP can be used to test
experimental changes in game play/interface and exploit the changes
which show positive experiences for players.
EXPLORATION AND EXPLOITATION IN RL
Exploration
Exploration is more of a long-term benefit concept where it
allows the agent to improve its knowledge about each action which
could lead to long term benefit.

Exploitation
Exploitation basically exploits the agent’s current estimated
value and chooses the greedy approach to get the most reward.
However, the agent is being greedy with the estimated value and
not the actual value, so chances are it might not get the most
reward.

Let’s take an interesting example to understand Exploration-


Exploitation properly.

Let’s say your friend and you digging in the hope that they will get
diamond out of it. Your friend gets lucky and finds the diamond before
you and walks off happily.

By seeing this, you get a bit greedy and think that you might also
get lucky. So, you start digging at the same spot as your friend.
Your action is called the greedy action and the policy is called
the greedy policy.

However, in this situation the Greedy policy would fail because a


bigger diamond is buried where you were digging in the beginning.

However, when your friend found the diamond, the only knowledge
you got was the depth at which the diamond was buried. You do not
have the knowledge of what lies beyond that depth. In reality the
diamond may be where you were digging in the beginning or it may be
where your friend was digging, or it may be completely at a different
place.

With such partial knowledge about future states and future


rewards, our reinforcement learning agent will be in dilemma on whether
to exploit the partial knowledge to receive some rewards or it
should explore unknown actions which could result in much larger
rewards.

However, we cannot choose both explore and exploit


simultaneously.
ACTION-VALUE METHODS
We begin by looking more closely at some simple methods for
estimating the values of actions and for using the estimates to make
action selection decisions. In this chapter, we denote the true (actual)
value of action as , and the estimated value at the th play as
.

Recall that the true value of an action is the mean reward received
when that action is selected. One natural way to estimate this is by
averaging the rewards actually received when the action was selected. In
other words, if at the th play action has been chosen times prior to
, yielding rewards , then its value is estimated to be

(2.1)

If , then we define instead as some default value, such


as . As , by the law of large numbers converges
to .

We call this the sample-average method for estimating action


values because each estimate is a simple average of the sample of
relevant rewards. Of course this is just one way to estimate action
values, and not necessarily the best one.

Nevertheless, for now let us stay with this simple estimation


method and turn to the question of how the estimates might be used to
select actions.
The simplest action selection rule is to select the action (or one of
the actions) with highest estimated action value, that is, to select on
play one of the greedy actions, , for which . This
method always exploits current knowledge to maximize immediate
reward; it spends no time at all sampling apparently inferior actions to
see if they might really be better.

A simple alternative is to behave greedily most of the time, but every


once in a while, say with small probability , instead select an action at
random, uniformly, independently of the action-value estimates. We call
methods using this near-greedy action selection rule -greedy methods.

An advantage of these methods is that, in the limit as the number


of plays increases, every action will be sampled an infinite number of
times, guaranteeing that for all , and thus ensuring that all
the converge to . This of course implies that the probability of
selecting the optimal action converges to greater than , that is, to
near certainty. These are just asymptotic guarantees, however, and say
little about the practical effectiveness of the methods.

To roughly assess the relative effectiveness of the greedy and -


greedy methods, we compared them numerically on a suite of test
problems. This is a set of 2000 randomly generated -armed bandit
tasks with . For each action, , the rewards were selected from a
normal (Gaussian) probability distribution with mean and
variance .

The 2000 -armed bandit tasks were generated by reselecting


the 2000 times, each according to a normal distribution with
mean and variance . Averaging over tasks, we can plot the
performance and behavior of various methods as they improve with
experience over 1000 plays, as in Figure 2.1. We call this suite of test
tasks the 10-armed testbed.

Figure 2.1: Average performance of -greedy action-value


methods on the 10-armed testbed. These data are averages over 2000
tasks. All methods used sample averages as their action-value
estimates.
Action Value Function

No Exploration (Greedy Approach)

A naïve approach could be to calculate the q, or action value


function, for all arms at each timestep. From that point onwards, select
an action which gives the maximum q. The action values for each action
will be stored at each timestep by the following function:
It then chooses the action at each timestep that maximises the above
expression, given by:

However, for evaluating this expression at each time t, we will need


to do calculations over the whole history of rewards. We can avoid this
by doing a running sum. So, at each time t, the q-value for each action
can be calculated using the reward:

The problem here is this approach only exploits, as it always picks


the same action without worrying about exploring other actions that
might return a better reward. Some exploration is necessary to actually
find an optimal arm, otherwise we might end up pulling a suboptimal
arm forever.

Epsilon Greedy Approach

One potential solution could be to now, and we can then explore


new actions so that we ensure we are not missing out on a better choice
of arm. With epsilon probability, we will choose a random action
(exploration) and choose an action with maximum qt(a) with probability
1-epsilon.

With probability 1- epsilon – we choose action with maximum


value (argmaxa Qt(a))

With probability epsilon – we randomly choose an action from a


set of all actions A

For example, if we have a problem with two actions – A and B, the epsilon
greedy algorithm works as shown below:

This is much better than the greedy approach as we have an element of


exploration here. However, if two actions have a very minute difference
between their q values, then even this algorithm will choose only that
action which has a probability higher than the others.

MARKOV PROPERTY AND ITS PROCESSES.


THE MARKOV PROPERTY

Transition: Moving from one state to another is called


Transition.
Transition Probability: The probability that the agent will
move from one state to another is called transition probability.
The Markov Property state that :
“Future is Independent of the past given the present”
Mathematically we can express this statement as :

S[t] denotes the current state of the agent and s[t+1] denotes
the next state. What this equation means is that the transition
from state S[t] to S[t+1] is entirely independent of the past.

So, the RHS of the Equation means the same as LHS if the
system has a Markov Property. Intuitively meaning that our
current state already captures the information of the past states.

State Transition Probability :

As we now know about transition probability we can define


state Transition Probability as follows :
For Markov State from S[t] to S[t+1] i.e. any other successor state
, the state transition probability is given by:

We can formulate the State Transition probability into a State


Transition probability matrix by :
Each row in the matrix represents the probability from
moving from our original or starting state to any successor
state.Sum of each row is equal to 1.

Markov Process or Markov Chains

Markov Process is the memory less random process i.e. a


sequence of a random state S[1],S[2],….S[n] with a Markov
Property.

So, it’s basically a sequence of states with the Markov Property. It


can be defined using a set of states(S) and transition probability
matrix (P).The dynamics of the environment can be fully defined
using the States(S) and Transition Probability matrix(P).

But what random process means ?

To answer this question let’s look at a example:


The edges of the tree denote transition probability. From this
chain let’s take some sample. Now, suppose that we were sleeping
and the according to the probability distribution there is
a 0.6 chance that we will Run and 0.2 chance we sleep
more and again 0.2 that we will eat ice-cream. Similarly, we can
think of other sequences that we can sample from this chain.
Some samples from the chain :
 Sleep — Run — Ice-cream — Sleep

 Sleep — Ice-cream — Ice-cream — Run

In the above two sequences what we see is we get random set of


States(S) (i.e. Sleep,Ice-cream,Sleep ) every time we run the
chain.Hope, it’s now clear why Markov process is called random
set of sequences.

Markov Decision Process :


It is Markov Reward Process with a decisions.
Everything is same like MRP but now we have actual agency
that makes decisions or take actions.

It is a tuple of (S, A, P, R, 𝛾) where:


 S is a set of states,

 A is the set of actions agent can choose to take,

 P is the transition Probability Matrix,

 R is the Reward accumulated by the actions of the agent,

 𝛾 is the discount factor.

OPTIMAL VALUE FUNCTIONS

Solving a reinforcement learning task means, roughly,


finding a policy that achieves a lot of reward over the long run.
For finite MDPs, we can precisely define an optimal policy in the
following way.

Value functions define a partial ordering over policies. A


policy is defined to be better than or equal to a policy if its
expected return is greater than or equal to that of for all states.
In other words, if and only if for all .

There is always at least one policy that is better than or equal


to all other policies. This is an optimal policy. Although there may
be more than one, we denote all the optimal policies by .
They share the same state-value function, called the optimal state-
value function, denoted , and defined as, for all .

Optimal policies also share the same optimal action-value


function, denoted , and defined as, or all and .

For the state-action pair , this function gives the expected


return for taking action in state and thereafter following an
optimal policy. Thus, we can write in terms of asfollows:

TRACKING NON-STATIONARY PROBLEMS WITH A


CONSTANT STEP-SIZE APPROACH

The various types of RL-relevant problems you can


encounter have been fairly rigorously characterised over the
years. We’re going to focus on ‘Non-stationary problems’, this
subset of problems have true underlying values that change over
time.
This introduces an interesting dynamic between optimising
the total reward we receive and selecting the best actions to take
(they change over time).

This approach converges to the true value over time. For a


non-stationary problem the true value is going to vary over time
so intuitively an approach that converges to a single value isn’t
going to work well.

Because the goal changes over time the most useful data is
going to be from the most recent rewards. Older rewards are going
to be much less useful because the change to the action-values
will accumulate over time.

Starting with the incremental update formula for the sample-


average algorithm we’re going fix the step-size to be a constant,
then show that this achieves our goal, with Qn+1 depending on a
weighted average of the rewards.

Step one is to consider the the incremental sample-average


formula:

Incrementally calculating an estimate for the value of an action

Replace (1/n) with a constant: 𝛼

Step two is to rearrange the formula as follows


Collect together the Qn terms

Step three consider the first equation above, we can reformulate


it to be in terms of n, rather than n+1:

Just subtract 1 from every n

Step four, now substitute this into the final equation from step
two:

separate it out into three terms

If we look at the end of the equation you can see a (Qn-1) term.
Once again we can get an expression for (Qn-1) by subtracting 1
from n:

Substitute this in:


This is an infinite process — we’ll stop here because the pattern
is becoming clearer.

Step five, extend what we have discovered so far to the general


case using summation notation:

What does this mean?

We wanted an algorithm that puts more emphasis on recent


data. We can see that this is achieved here by weighting the sum
of the rewards. The contribution of each reward is greatest when
it is new, then expenentially decays away.

This approach is sometimes referred as exponential recency


weighted averages (ERWA).

UPPER-CONFIDENCE-BOUND ACTION SELECTION


Upper-Confidence Bound action selection uses uncertainty
in the action-value estimates for balancing exploration and
exploitation.
Since there is inherent uncertainty in the accuracy of the
action-value estimates when we use a sampled set of rewards
thus UCB uses uncertainty in the estimates to drive exploration.
Qt(a) here represents the current estimate for action a at
time t. We select the action that has the highest estimated action-
value plus the upper-confidence bound exploration term.
GRADIENT BANDITS
Another way to balance exploration and exploitation is the
gradient bandit algorithm.

So far we considered methods that estimate action values and then


use those to select other actions. This is often a good approach, but it is
not the only one possible. We can also consider learning some preference
Ht(a) (which is just a value) for each action a. The larger the preference,
the more often that action is taken.
Note that the preference has no interpretation in terms of reward.
Only the relative preference of one action over another is important.

The action probabilities are determined according to a soft-max


distribution (i.e. the prob. of taking the actions all sum up to 1):
Note the new notation π_t(a) = the probability of taking action “a” at
time t.
When we’re doing this, we update it with stochastic gradient
descent

Taking our current preference of a state H_t(a), and update it based on


the reward received (R_t), minus the average reward (R_hat_t), times
the probability of taking that action π_t(a).

ASSOCIATIVE SEARCH (CONTEXTUAL BANDITS)

The contextual bandit extends the model by making the decision


conditional on the state of the environment.
For example, you can use a contextual bandit to select which news
article to show first on the main page of your website to optimize click
through rate.

The context is information about the user: where they come from,
previously visited pages of the site, device information, geolocation, etc.
An action is a choice of what news article to display. An outcome is
whether the user clicked on a link or not. A reward is binary: 0 if there
is no click, 1 if there is a click.

REFERNCE LINK FOR LEARNING FORMULAE

https://lcalem.github.io/blog/2018/09/22/sutton-
chap02-bandits

You might also like