Reinforcement Learning - Chapter 2
Reinforcement Learning - Chapter 2
(AUTONOMOUS)
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND
DATA SCIENCE
Unit II -Notes
PREPARED BY APPROVED BY
G.SIVASATHIYA,AP/AI&DS HOD/AI&DS
UNIT – 2
MULTI ARM BANDITS AND MARKOV DECISION
PROCESS
Use Cases
Bandit algorithms are being used in a lot of research projects in
the industry. Listed some of their use cases in this section.
Clinical Trials
Network Routing
Routing is the process of selecting a path for traffic in a network,
such as telephone networks or computer networks (internet). Allocation
of channels to the right users, such that the overall throughput is
maximised, can be formulated as a MABP.
Online Advertising
The goal of an advertising campaign is to maximise revenue from
displaying ads. The advertiser makes revenue every time an offer is
clicked by a web user. Similar to MABP, there is a trade-off between
exploration, where the goal is to collect information on an ad’s
performance using click-through rates, and exploitation, where we stick
with the ad that has performed the best so far.
Game Designing
Building a hit game is challenging. MABP can be used to test
experimental changes in game play/interface and exploit the changes
which show positive experiences for players.
EXPLORATION AND EXPLOITATION IN RL
Exploration
Exploration is more of a long-term benefit concept where it
allows the agent to improve its knowledge about each action which
could lead to long term benefit.
Exploitation
Exploitation basically exploits the agent’s current estimated
value and chooses the greedy approach to get the most reward.
However, the agent is being greedy with the estimated value and
not the actual value, so chances are it might not get the most
reward.
Let’s say your friend and you digging in the hope that they will get
diamond out of it. Your friend gets lucky and finds the diamond before
you and walks off happily.
By seeing this, you get a bit greedy and think that you might also
get lucky. So, you start digging at the same spot as your friend.
Your action is called the greedy action and the policy is called
the greedy policy.
However, when your friend found the diamond, the only knowledge
you got was the depth at which the diamond was buried. You do not
have the knowledge of what lies beyond that depth. In reality the
diamond may be where you were digging in the beginning or it may be
where your friend was digging, or it may be completely at a different
place.
Recall that the true value of an action is the mean reward received
when that action is selected. One natural way to estimate this is by
averaging the rewards actually received when the action was selected. In
other words, if at the th play action has been chosen times prior to
, yielding rewards , then its value is estimated to be
(2.1)
For example, if we have a problem with two actions – A and B, the epsilon
greedy algorithm works as shown below:
S[t] denotes the current state of the agent and s[t+1] denotes
the next state. What this equation means is that the transition
from state S[t] to S[t+1] is entirely independent of the past.
So, the RHS of the Equation means the same as LHS if the
system has a Markov Property. Intuitively meaning that our
current state already captures the information of the past states.
Because the goal changes over time the most useful data is
going to be from the most recent rewards. Older rewards are going
to be much less useful because the change to the action-values
will accumulate over time.
Step four, now substitute this into the final equation from step
two:
If we look at the end of the equation you can see a (Qn-1) term.
Once again we can get an expression for (Qn-1) by subtracting 1
from n:
The context is information about the user: where they come from,
previously visited pages of the site, device information, geolocation, etc.
An action is a choice of what news article to display. An outcome is
whether the user clicked on a link or not. A reward is binary: 0 if there
is no click, 1 if there is a click.
https://lcalem.github.io/blog/2018/09/22/sutton-
chap02-bandits