0% found this document useful (0 votes)
28 views9 pages

Unit:1 Reinforcement Learning

nil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views9 pages

Unit:1 Reinforcement Learning

nil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

UNIT :1 REINFORCEMENT LEARNING

TOPIC 2: MULTI-ARM BANDITS , AN N-ARMED BANDIT


PROBLEM AND TRACKING A NON STATIONARY PROBLEM
IN REINFORCEMENT LEARNING

1.MULTI ARM BANDITS:


Multi-Armed Bandit Problem

In Reinforcement Learning, we use the Multi-Armed Bandit Problem to formalize

the notion of decision-making under uncertainty using k-armed bandits.

A decision-maker or agent is present in a Multi-Armed Bandit Problem to choose

between k-different actions and receives a reward based on the action it chooses.

Bandit problem is used to describe fundamental concepts in reinforcement learning,

such as rewards, timesteps, and values.

The picture above represents a slot machine also known as a bandit with two levers.

We assume that each lever has a separate distribution of rewards and there is at least
one lever that generates maximum reward.The probability distribution for the reward

corresponding to each lever is different and is unknown to the gambler(decision-

maker). Hence, the goal here is to identify which lever to pull to get the maximum

reward after a given set of trials.

What is Multi-Armed Bandit Problem

● The Multi-Armed bandit problem is a learner. The learner takes some action

and the environment returns some reward value. The learner has to find a

policy that leads to maximum rewards.To understand the multi-armed bandit

problem, first, see a one-armed bandit problem

● Suppose we have a slot machine, which has one lever and a screen. The screen

displays three or more wheels. That looks something like that-When you pull

the lever, the game is activated. This single lever represents the single-arm or

one-arm bandit.

● Suppose The advertiser has to find out the click rate of each ad for the same

product. The objective of the advertiser is to find the best advertisement. So,

these are the steps of Multi-Armed Bandit Problem–


1. We have m ads. The advertiser displays these ads to there user when

the user visits the web page.

2. Each time a user visits the web page, that makes one round.

3. At each round, the advertiser chooses one ad to display to the user.

4. At each round n, ad j gets reward rj(n) ∈ {0,1} : rj(n)=1 if user clicked

on the ad, and 0 if user didn’t click the ad.

5. The advertiser’s goal is to maximize the total reward from all rounds.

For Example:

● Imagine an online advertising trial where an advertiser wants to

measure the click-through rate of three different ads for the same

product. Whenever a user visits the website, the advertiser displays an

ad at random. The advertiser then monitors whether the user clicks on

the ad or not. After a while, the advertiser notices that one ad seems to
be working better than the others. The advertiser must now decide

between sticking with the best-performing ad or continuing with the

randomized study.

● If the advertiser only displays one ad, then he can no longer collect data

on the other two ads. Perhaps one of the other ads is better, it only

appears worse due to chance. If the other two ads are worse, then

continuing the study can affect the click-through rate adversely. This

advertising trial exemplifies decision-making under uncertainty.

● In the above example, the role of the agent is played by an advertiser.

The advertiser has to choose between three different actions, to display

the first, second, or third ad. Each ad is an action. Choosing that ad

yields some unknown reward. Finally, the profit of the advertiser after

the ad is the reward that the advertiser receives.

An n-Armed Bandit Problem

○ This is the original form of the n-armed bandit problem,

so named by anal- ogy to a slot machine, or “one-armed

bandit,” except that it has n levers instead of one. Each

action selection is like a play of one of the slot machine’s

levers, and the rewards are the payoffs for hitting the
jackpot.

○ If instead you select one of the nongreedy actions, then we

say you are exploring, because this enables you to improve

your estimate of the nongreedy action’s value

○ If you have many time steps ahead on which to make

action selections, then it may be better to explore the

nongreedy actions and discover which of them are better

than the greedy action.

○ The need to balance exploration and exploitation is a

distinctive challenge that arises in reinforcement learning;

the simplicity of the n-armed bandit problem enables us to

show this in a particularly clear form.

If Nt(a) = 0, then we define Qt(a) instead as some default value,

such as Q1(a) = 0. As Nt(a) , by the law of large numbers,

Qt(a) converges to q(a).

The simplest action selection rule is to select the action (or one

of the actions) with highest estimated action value, that is, to select

at step t one of the greedy actions, At∗ , for which Qt(A∗ t ) = maxa

Qt(a). This greedy action selection method can be written as


At = argmax Qt(a),a

Action-Values:
For the advertiser to decide which action is best, we must define the value of taking
each action. We define these values using the action-value function using the
language of probability. The value of selecting an action q*(a) is defined as the
expected reward Rt we receive when taking an action from the possible set of actions.

The goal of the agent is to maximize the expected reward by selecting the action that
has the highest action-value.

Action-value Estimate:

Since the value of selecting an action i.e. Q*(a) is not known to the agent, so we will
use the sample-average method to estimate it.
Exploration vs Exploitation:

● Greedy Action: When an agent chooses an action that currently has the
largest estimated value. The agent exploits its current knowledge by
choosing the greedy action.
● Non-Greedy Action: When the agent does not choose the largest
estimated value and sacrifice immediate reward hoping to gain more
information about the other actions.
● Exploration: It allows the agent to improve its knowledge about each
action. Hopefully, leading to a long-term benefit.
● Exploitation: It allows the agent to choose the greedy action to try to get
the most reward for short-term benefit. A pure greedy action selection
can lead to sub-optimal behaviour.

A dilemma occurs between exploration and exploitation because an agent can not
choose to both explore and exploit at the same time. Hence, we use the Upper
Confidence Bound algorithm to solve the exploration-exploitation dilemma

Tracking a Nonstationary Problem


One of the most popular ways of doing this is to use a constant step-

size parameter. For example, the incremental update rule (2.3) for updating an

average Qk of the k − 1 past rewards is modified to be

Qk+1 = Qk + αhRk − Qki,

where the step-size parameter α ∈ (0, 1]1 is constant. This results in Qk+1being a
h i
weighted average of past rewards and the initial estimate Q1:
Qk+1 = Qk + α Rk − Qk

= αRk + (1 − α)Qk

= αRk + (1 − α) [αRk−1 + (1 − α)Qk−1]


= αRk + (1 − α)αRk−1 + (1 − α) Qk−12
= αRk + (1 − α)αRk−1 + (1 − α) αRk−2
2 +

· · · + (1 − α) k−1
αR1 + (1
k − α) Q
1

k Σ
= (1 − α)kQ1 +α(1 − α)k−iRi.
i=1

A well- known result in stochastic approximation theory gives us


the conditions re- quired to assure convergence with probability 1:

Σ
αk(a)
∞ =∞ and Σ
∞2

k < ∞.
α (a)
k=1 k=1

The first condition is required to guarantee that the steps are large
enough to eventually overcome any initial conditions or random
fluctuations. The second condition guarantees that eventually the
steps become small enough to assure convergence.

Note that both convergence conditions are met for the sample-
average case, αk(a) = 1
, but not for the case of constant step-size
parameter, αk(a) = α. In the latter case, the second condition is not
met, indicating that the estimates never completely converge but
continue to vary in response to the most re- cently received rewards.

You might also like