Unit:1 Reinforcement Learning
Unit:1 Reinforcement Learning
between k-different actions and receives a reward based on the action it chooses.
The picture above represents a slot machine also known as a bandit with two levers.
We assume that each lever has a separate distribution of rewards and there is at least
one lever that generates maximum reward.The probability distribution for the reward
maker). Hence, the goal here is to identify which lever to pull to get the maximum
● The Multi-Armed bandit problem is a learner. The learner takes some action
and the environment returns some reward value. The learner has to find a
● Suppose we have a slot machine, which has one lever and a screen. The screen
displays three or more wheels. That looks something like that-When you pull
the lever, the game is activated. This single lever represents the single-arm or
one-arm bandit.
● Suppose The advertiser has to find out the click rate of each ad for the same
product. The objective of the advertiser is to find the best advertisement. So,
2. Each time a user visits the web page, that makes one round.
5. The advertiser’s goal is to maximize the total reward from all rounds.
For Example:
measure the click-through rate of three different ads for the same
the ad or not. After a while, the advertiser notices that one ad seems to
be working better than the others. The advertiser must now decide
randomized study.
● If the advertiser only displays one ad, then he can no longer collect data
on the other two ads. Perhaps one of the other ads is better, it only
appears worse due to chance. If the other two ads are worse, then
continuing the study can affect the click-through rate adversely. This
yields some unknown reward. Finally, the profit of the advertiser after
levers, and the rewards are the payoffs for hitting the
jackpot.
The simplest action selection rule is to select the action (or one
of the actions) with highest estimated action value, that is, to select
at step t one of the greedy actions, At∗ , for which Qt(A∗ t ) = maxa
Action-Values:
For the advertiser to decide which action is best, we must define the value of taking
each action. We define these values using the action-value function using the
language of probability. The value of selecting an action q*(a) is defined as the
expected reward Rt we receive when taking an action from the possible set of actions.
The goal of the agent is to maximize the expected reward by selecting the action that
has the highest action-value.
Action-value Estimate:
Since the value of selecting an action i.e. Q*(a) is not known to the agent, so we will
use the sample-average method to estimate it.
Exploration vs Exploitation:
● Greedy Action: When an agent chooses an action that currently has the
largest estimated value. The agent exploits its current knowledge by
choosing the greedy action.
● Non-Greedy Action: When the agent does not choose the largest
estimated value and sacrifice immediate reward hoping to gain more
information about the other actions.
● Exploration: It allows the agent to improve its knowledge about each
action. Hopefully, leading to a long-term benefit.
● Exploitation: It allows the agent to choose the greedy action to try to get
the most reward for short-term benefit. A pure greedy action selection
can lead to sub-optimal behaviour.
A dilemma occurs between exploration and exploitation because an agent can not
choose to both explore and exploit at the same time. Hence, we use the Upper
Confidence Bound algorithm to solve the exploration-exploitation dilemma
size parameter. For example, the incremental update rule (2.3) for updating an
where the step-size parameter α ∈ (0, 1]1 is constant. This results in Qk+1being a
h i
weighted average of past rewards and the initial estimate Q1:
Qk+1 = Qk + α Rk − Qk
= αRk + (1 − α)Qk
· · · + (1 − α) k−1
αR1 + (1
k − α) Q
1
k Σ
= (1 − α)kQ1 +α(1 − α)k−iRi.
i=1
Σ
αk(a)
∞ =∞ and Σ
∞2
k < ∞.
α (a)
k=1 k=1
The first condition is required to guarantee that the steps are large
enough to eventually overcome any initial conditions or random
fluctuations. The second condition guarantees that eventually the
steps become small enough to assure convergence.
Note that both convergence conditions are met for the sample-
average case, αk(a) = 1
, but not for the case of constant step-size
parameter, αk(a) = α. In the latter case, the second condition is not
met, indicating that the estimates never completely converge but
continue to vary in response to the most re- cently received rewards.