100% found this document useful (1 vote)
908 views26 pages

RL Unit 1

The document discusses reinforcement learning, including definitions, key concepts, and applications. Reinforcement learning involves an agent interacting with an environment to learn optimal behaviors through trial-and-error and feedback in the form of rewards or penalties. Key elements discussed include the policy, reward signal, value function, and optional environment model. Approaches to reinforcement learning include value-based methods using value functions, policy-based methods directly optimizing policies, and model-based methods using learned models of the environment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
908 views26 pages

RL Unit 1

The document discusses reinforcement learning, including definitions, key concepts, and applications. Reinforcement learning involves an agent interacting with an environment to learn optimal behaviors through trial-and-error and feedback in the form of rewards or penalties. Key elements discussed include the policy, reward signal, value function, and optional environment model. Approaches to reinforcement learning include value-based methods using value functions, policy-based methods directly optimizing policies, and model-based methods using learned models of the environment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Reinforcement Learning

UNIT-I
Basics of probability and linear algebra, Definition of a stochastic
multi-armed bandit, Definition of regret, achieving sub linear regret,
UCB algorithm, KL-UCB, Thompson Sampling
Reinforcement Learning :

 It is a feedback based learning method in which a learning agent gets a


reward for each right action and gets up punishment/penalty for each
wrong action.
 The agent learn automatically with these feedback and improves its
performance.
 In Reinforcement Learning the agent interacts with the environment and
explores it.
 The goal of an agent is to get the most reward points and hence it
improve its performance. The example of RL is a self-driving car.

1. What is reinforcement learning? State one practical example.

 Reinforcement learning is a branch of machine learning that focuses on


how an agent can learn to make decisions or take actions in an
environment to maximize its cumulative rewards. It is inspired by the
process of learning through trial and error approach.
 In reinforcement learning, an agent interacts with an environment and
receives feedback in the form of rewards or penalties based on its actions.
 The goal of the agent is to learn a policy or a set of actions that maximize
the expected cumulative reward over time.
 The agent explores the environment, tries different actions, and receives
feedback, allowing it to learn which actions lead to higher rewards and
which ones are less rewards.
 Reinforcement learning involves the use of algorithms and mathematical
models to develop optimal strategies for decision-making.
 The agent learns with the process of hit and trial, and based on the
experience, it learns to perform the task in a better way. Hence, we can
say that "Reinforcement learning is
a type of machine learning method where an intelligent agent
(computer program) interacts with the environment and learns
to act within that." How a Robotic dog learns the movement of his
arms is an example of Reinforcement learning.

 It is a core part of Artificial intelligence, and all AI agent works on the


concept of reinforcement learning. Here we do not need to pre-program
the agent, as it learns from its own experience without any human
intervention.
 The agent learns through a combination of exploration (trying out new
actions to gather information) and exploitation (using its current
knowledge to make decisions).

One key feature of reinforcement learning is the use of a reward signal, which
provides feedback to the agent based on its actions taken. The agent's goal is
to learn to select actions that maximize long-term cumulative rewards, rather
than optimizing for immediate rewards.

 Example:
Suppose there is an AI agent present within a maze
environment, and his goal is to find the diamond. The agent
interacts with the environment by performing some
actions, and based on those actions, the state of the agent
gets changed, and it also receives a reward or penalty as
feedback.
The agent continues doing these three things (take action,
change state/remain in the same state, and get
feedback), and by doing these actions, he learns and explores
the environment.
Applications of Reinforcement learning:
It has applications in various domains including
 Robotics navigation
 Game playing,
 Autonomous vehicles,
 Finance,
 Healthcare
 Marketing strategy control
 Webpage indexing and more

 It has been successfully used to train agents that can play complex
games, control robotic systems, optimize resource allocation, and
make decisions in uncertain and dynamic environments.
(q) State key constituents of reinforcement learning. (Explain key terms in
reinforcement learning.)

The key terms of reinforcement learning are :

Agent(): An entity that can explore the environment and act upon it.
Environment : A situation in which an agent is present or surrounded
by. In RL, we assume the stochastic environment, which means it is
random in nature.
Action : Actions are the moves taken by an agent within the environment.
State : State is a situation returned by the environment after each action
taken by the agent.
Reward : A feedback returned to the agent from the environment to
evaluate the action of the agent.

State key features of reinforcement


learning.

In RL, the agent is not instructed about the environment and what
actions need to be taken.
It is based on the hit and trial process.
The agent takes the next action and changes states according to the
feedback of the previous action.
The agent may get a delayed reward.
The environment is stochastic, and the agent needs to explore it to
reach to get the maximum positive rewards.

(Q) Explain elements of reinforcement learning

Apart from the environment in which agent act, reinforcement


learning system have four main sub elements:
1. Policy
2. Reward signal,
3. Value function, and,
4. A model of the environment (OPTIONAL)

1. Policy: (Rule)
The policy is the core of a reinforcement learning and it is alone
sufficient to determine behavior.
 Policy is an agent behavior function
 It defines agent behavior (action) to take in a given situation
 A policy is a function that maps agent current state to an action
 In general, policies may be stochastic, specifying probabilities
for each action or deterministic
For deterministic policy: a = π(s)
For stochastic policy: π(a | s) = P[At =a | St = s]

2. Reward signal:
 A reward signal defines the goal of a reinforcement learning
problem.
 A reward is a numerical value which sent the environment to
the reinforcement learning agent every time it perform the
action(feedback)
 The goal of RL agent is to maximize the total reward it receives
over the long run.
 The reward signal is the primary basis for altering the policy; if
an action selected by the policy is followed by low reward, then
the policy may be changed to select some other action in that
situation in the future .i.e. the policy changes its behavior based
up on the reward signal.
 In general, reward signals may be stochastic functions of the
state of the environment and the actions taken.

3. The value function:


 Value function is a prediction for future reward
 A value function specifies what is good in the each state
and/or action in the long run.
 Used to evaluate the goodness/badness of state

 The value of a state is the total amount of reward an agent can


expect to accumulate over the future, starting from that state.
 The value function depends on the reward as, without reward,
there could be no value. The goal of estimating values is to
achieve more rewards.
 In fact, the most important component of almost all
reinforcement learning algorithms we consider is a method for
efficiently estimating values
 To select between actions the value is estimated as

V∏( S) = E∏ [ Rt + ۷Rt+1 + ۷2Rt


+1
+ ۷3Rt
+1
+---------/S =S]
t
4.Model of the environment :
 The fourth and final element of some reinforcement learning
systems is a model of the environment
 Model is agent representation of the environment
 The model of the environment is something which define how
the environment will behave when action is performed in a
given state
 For a given state and action the model can predict the next
state and reward.
 Methods for solving reinforcement learning problems that use
models and planning are called model-based methods
 model-free methods that are explicitly trial-and-error learners—
viewed as almost the opposite of planning
 we explore reinforcement learning systems that simultaneously
learn by trial and error
 Modern reinforcement learning extends learning process from
low-level, trial-and-error learning to high-level, deliberative
planning.
TWO fundamental problems in sequential decision Making
1.Reinforcement Learning : (MODEL-FREE METHOD)
 The environment is initially unknown
 The agent interact with the environment
 The agent improves its policy
2.planing :
o A model Of the environment is known
o The agent performs computation with its model (without
any external interaction. i.e when the state and action
given it gives reward and the next state to reach
o The agent improve its policy
(q)Explain approaches to implement reinforcement learning.
OR
Explain value-based, policy-based, and model-based
reinforcement learning
.
There are mainly three ways to implement reinforcement-learning in ML,
which are:

Value-based: The value-based approach is about to find the optimal value


function, which is the maximum value at a state under any policy.
Therefore, the agent expects the long-term return at any state(s) under
policy π.

Policy-based: Policy-based approach is to find the optimal policy for the


maximum future rewards without using the value function. In this
approach, the agent tries to apply such a policy that the action performed
in each step helps to maximize the future reward. The policy-based
approach has mainly two types of policy:
 Deterministic: The same action is produced by the policy (π) at any state.
 Stochastic: In this policy, probability determines the produced action.

Model-based: In the model-based approach, a virtual model is created for the


environment, and the agent explores that environment to learn it. There is no
particular solution or algorithm for this approach because the model representation
is different for each environment.
Model based:
 Policy and /or value function
 Model
Value based:
 No policy
 Value function
Policy based:
 Policy
 No value function
Actor Critic :
 Policy
 Value function
(Q) Difference between Reinforcement Learning and Supervised
Learning

 The Reinforcement Learning and Supervised Learning


both are the part of machine learning, but both types of
learnings are far opposite to each other.
 The RL agents interact with the environment, explore it,
take action, and get rewarded.
 Whereas supervised learning ,algorithms learn from the
labeled dataset and, on the basis of the training, predict
the output.
 The difference table between RL and Supervised learning
is given below

Reinforcement Learning Supervised Learning

RL works by interacting with the Supervised learning works on the existing


Environment. Labeled dataset.
The RL algorithm works like the human brain Supervised Learning works as when a human
Works when making some decisions. Learns things in the supervision of a guide.
There is no labeled dataset is present The labeled dataset is present.
No previous training is provided to the Training is provided to the algorithm so that
learning agent. it can predict the output.
RL helps to take decisions sequentially. In Supervised learning, decisions are made
when input is given.
1.6 Summary

Reinforcement learning is a computational approach to


understanding and automating goal-directed learning and decision
making. It is distinguished from other computational approaches by
its emphasis on learning by an agent from direct interaction with its
Environment, without requiring exemplary supervision or complete
models of the environment.

In our opinion, reinforcement learning is the first field to seriously


address the computational issues that arise when learning from
interaction with an environment in order to achieve long-term goals.

Reinforcement learning uses the formal framework of Markov


decision processes to define the interaction between a learning agent
and its environment in terms of states, actions, and rewards. This
framework is intended to be a simple way of representing essential
features of the artificial intelligence problem. These features include
a sense of cause and effect, a sense of uncertainty and
nondeterminism, and the existence of explicit goals.

The concepts of value and value function are key to most


of the reinforcement learning methods that we consider in this book.
We take the position that value functions are important for efficient
search in the space of policies. The use of value functions
distinguishes reinforcement learning methods from evolutionary
methods that search directly in policy space guided by evaluations of
entire policies.
Basics of probability in reinforcement learning:
Probability is an integral part of reinforcement learning, as it helps agents
make decisions, learn from interactions with the environment, and estimate
the expected outcomes of their actions. RL algorithms use the probability to
optimize policies that maximize cumulative rewards over time.

Here are some basics of probability in the context of reinforcement


learning:

1. Markov Decision Processes (MDPs): Reinforcement learning problems


are often formulated as Markov Decision Processes, which consist of
states, actions, transition probabilities, rewards, and a discount
factor. Transition probabilities represent the likelihood of
transitioning from one state to another after taking a specific action.
2. State Transitions: When an agent takes an action in a specific state,
the environment transitions to a new state based on the transition
probabilities. These probabilities determine the likelihood of moving
to different states. In probabilistic terms, they define a probability
distribution over the next states.
3. Policy: A policy in reinforcement learning represents the agent's
behavior, i.e., the strategy it uses to select actions in different states.
Policies can be deterministic (e.g., always selecting the same action
in a given state) or stochastic (selecting actions based on a probability
distribution).
4. Action Selection: In stochastic policies, the agent selects actions
based on probability distributions. These distributions can be
explicitly defined or implicitly represented by value functions.
Common methods for action selection include softmax, epsilon-
greedy, and Thompson sampling, which use probabilities to balance
and .

5. Transition and Reward Distributions: The transition probabilities


define the likelihood of transitioning to different states, while the
reward distribution represents the probabilities of receiving different
rewards in different states. These distributions are often unknown to
the agent and need to be estimated through interactions with the
environment.

6. Probability Distributions and Sampling: In reinforcement learning,


agents often rely on sampling to estimate probabilities and make
decisions. They sample actions from probability distributions,
transition to new states based on transition probabilities, and
observe rewards from reward distributions. Statistical techniques
like Monte Carlo methods and temporal difference learning utilize
these samples to estimate value functions and improve decision-
making.
7. Exploration and Exploitation: Probability is fundamental to
balancing exploration (trying out different actions to learn more
about the environment) and exploitation (taking actions with high
expected returns). The agent uses probability distributions to make
decisions that account for both exploration and exploitation goals.

These are some of the basic concepts where probability comes into
play in reinforcement learning. Understanding and effectively using
probabilities is essential for agents to learn optimal policies and make
informed decisions in uncertain environments.

Basics of linear algebra :


Linear algebra is an important mathematical framework used in
various aspects of reinforcement learning. Here are some basics of
linear algebra relevant to reinforcement learning:

1. Vectors: Vectors are fundamental in representing and manipulating


quantities in reinforcement learning. In the context of state and
action spaces, vectors are used to represent the states and actions.
For example, a state vector can contain features or observations that
describe the current state of the environment.
2. Matrices: Matrices are rectangular arrays of numbers. They are
commonly used to represent transformations, such as state
transitions and policy representations. In reinforcement learning,
transition matrices can describe the probabilities of transitioning
between states given different actions.
3. Vector Operations: Several operations are performed on vectors in
reinforcement learning. These include addition, subtraction, scalar
multiplication, dot product, and element-wise operations. For
example, the dot product between two vectors can be used to
measure similarity or compute the value of a state-action pair.
4. Matrix Operations: Matrix operations are extensively used in
reinforcement learning algorithms. Some important operations
include matrix multiplication, transpose, inverse, and element-wise
operations. For example, matrix multiplication can be used to
compute the value function update in iterative algorithms like value
iteration or policy evaluation.
5. Eigenvalues and Eigenvectors: Eigenvalues and eigenvectors play a
significant role in many reinforcement learning algorithms. They are
used to analyze and characterize the behavior of linear
transformations, such as state transitions or policy updates.
Eigenvectors associated with the dominant eigenvalues can represent
important directions in the state or action space.
6. Matrix Factorization: Matrix factorization techniques, such as
Singular Value Decomposition (SVD) or Eigenvalue Decomposition,
are utilized in reinforcement learning for dimensionality reduction,
feature extraction, and policy approximation. These techniques
decompose a matrix into constituent parts that capture its essential
properties.
7. Linear Systems and Equations: Linear algebra provides tools to solve
linear systems and equations, which are prevalent in reinforcement
learning. For example, in dynamic programming, Bellman equations
can be represented as a system of linear equations that can be solved
to find optimal value functions.
8. Least Squares Estimation: Linear regression and least squares
estimation techniques are commonly used in reinforcement learning
to approximate value functions, policy parameters, or other function
approximations. These techniques utilize linear algebra concepts to
find the best-fit solutions to minimize the error between predictions
and observed values.

Understanding linear algebra enables reinforcement learning


practitioners to effectively model and manipulate state and action
spaces, analyze algorithms, and develop efficient solutions. It forms
the basis for many advanced concepts and techniques in the field.

Definition of a stochastic multi-armed bandit


 A stochastic multi-armed bandit (MAB) is a classic problem in the field of
reinforcement learning and decision theory.
 It represents a scenario where an agent (often referred to as a bandit) is
faced with a set of "arms" or options, each with an unknown reward
distribution.
 The agent's objective is to maximize its cumulative reward over a series
of actions.
 In a stochastic MAB, the reward distributions associated with each arm
are assumed to be stochastic, meaning that they follow some underlying
probability distribution.
 The rewards obtained from each arm are random variables, and their
distribution may vary across different arms.
 At each time step, the agent selects an arm to pull, and it receives a reward
based on the distribution associated with that arm.
 The agent's goal is to learn the arm with the highest expected reward by
sequentially exploring different arms and exploiting the knowledge
gained so far.
 The challenge in the stochastic multi-armed bandit problem lies in the
exploration-exploitation trade-off.
 On one hand, the agent needs to explore different arms to gather
information about their reward distributions. On the other hand, it should
exploit the knowledge it has gained so far to maximize its expected
reward by favoring the arms with higher estimated rewards.
Algorithms : Several algorithms have been developed to address the stochastic
multi-armed bandit problem, such as UCB (Upper Confidence Bound),
Thompson sampling, and EXP3 (Exponential Weighted Exploration and
Exploitation).

These algorithms employ various strategies to balance exploration and


exploitation based on probabilistic reasoning, enabling the agent to learn and
adapt its decision-making policy over time.

Applications of Stochastic multi-armed bandit:

 Stochastic multi-armed bandit problems have practical applications in


various domains, such as
 online advertising,
 clinical trials,
 recommendation systems
 and resource allocation,
 where decision-makers need to make sequential choices under
uncertainty to optimize long-term rewards.

Best example of exploration & exploitation

Exploration: Exploration is the process of go through the entire subject to learn


all the concepts (objectives ) of the subject

Exploitation: Exploitation is the way of presenting answer to the question in the


examination to get highest marks through the knowledge gained in exploration
process
k-multi-armed bandit:
 Consider the following learning problem. You are faced
repeatedly with a choice among k-different options, or actions.
After each choice you receive a numerical reward chosen from
a stationary probability distribution that depends on the action
you selected.
 Your objective is to maximize the expected total reward over
some time period, for example, over 1000 action selections, or
time steps.
 This is the original form of the k-armed bandit problem, so
named by analogy to a slot machine, or “one-armed bandit,”
except that it has k levers instead of one.
 Each action selection is like a play of one of the slot machine’s
levers, and the rewards are the payoffs for hitting the jackpot.
Through repeated action selections you are to maximize your
winnings by concentrating your actions on the best levers

 Another analogy is that of a doctor choosing between


experimental treatments for a series of seriously ill patients.
Each action is the selection of a treatment, and each reward is
the survival or well-being of the patient.

 In k-armed bandit problem, each of the k actions has an


expected or mean reward given that action is selected;
 let us call this the value of that action. We denote the action
selected on time step t as At, and the corresponding reward as
Rt.
 The value then of an arbitrary action a, denoted q*(a), is the
expected reward given that a is selected:
q*(a)=E[Rt |At=a] .

If you knew the value of each action, then it would be trivial to solve
the k-armed bandit problem: you would always select the action with
highest value
 We denote the estimated value of action a at time step t as
Qt*(a). We would like Qt*(a) to be close to q*(a).

 If you maintain estimates of the action values, then at any time


step there is at least one action whose estimated value is
greatest.We call these the greedy actions.
 When you select one of these actions, we say that you are
exploiting your current knowledge of the values of the actions.

Action-value Methods:
 The methods for estimating the values of actions and for using
the estimates to make action selection decisions, which we
collectively call action-value methods.
 One natural way to estimate this is by averaging the rewards
actually receive

 where 1predicate denotes the random variable that is 1 if


predicate is true and 0 if it is not.
 The simplest action selection rule is to select one of the actions
with the highest estimated value, that is, one of the greedy
actions.
 If there is more than one greedy action, then a selection is made
among them in some arbitrary way, perhaps randomly. We write
this greedy action selection method as

Where argmaxa denotes the action a for which the expression that
follows is maximized.
Greedy action selection always exploits current knowledge to
maximize immediate reward
Definition of regret:
In the context of decision-making and reinforcement learning, regret
is a concept that measures the performance of an agent or algorithm.

regret is defined as the difference between the total expected reward


obtained by the agent and the maximum evaluated reward that could
have been obtained by always choosing the best possible action.

 A lower regret indicates better performance, as it suggests that


the agent is making decisions closer to the optimal ones.

In multi-armed bandit problems, regret is commonly used to evaluate


the performance of algorithms. the regret (denoted as R(T)) after T
time steps is calculated as follows:

R(T) = (max_total_reward) - (total_reward_with_algorithm),

where:

max_total_reward (Expected reward) is the cumulative reward


obtained by always selecting the optimal action at each time step.

 total_reward_with_algorithm(estimated reward) is the cumulative


reward achieved by the bandit algorithm over T time steps.

In the context of the multi-armed bandit problem,


various algorithms and strategies are designed to minimize regret
and efficiently explore the action space to find the best actions.

 The goal of a bandit algorithm is to strike a balance between


exploration (trying out different actions to gather information)
and exploitation (selecting the best-known action) to achieve a
low regret and maximize cumulative rewards over time.
 Different bandit algorithms, such as epsilon-greedy, UCB
(Upper Confidence Bound), and Thompson Sampling, have been
developed with various regret bounds and performance
guarantees.
Achieving sublinear regret:
Achieving sub linear regret is a desirable property in reinforcement
learning and decision-making problems. Sub linear regret means
that as the number of actions or time steps increases, the regret
grows at a rate slower than linear, indicating that the agent's
performance improves over time.

To achieve sublinear regret, several techniques and algorithms are


commonly employed:

1. Exploration-Exploitation Trade-off:Balancing exploration and


exploitation is crucial in achieving sublinear regret. Initially, the
agent needs to explore different options to learn about the reward
distributions of different actions. As it gains knowledge, it should
gradually shift towards exploiting the best actions to maximize
cumulative rewards.
2. Upper Confidence Bound (UCB): UCB algorithms, such as UCB1 and
UCB2, incorporate uncertainty in the estimation of action values.
They use confidence intervals to balance exploration and
exploitation. By assigning higher confidence bounds to less explored
actions, UCB algorithms encourage exploration in the early stages
while converging to the optimal action in the long run.
3. Thompson Sampling: Thompson Sampling is a probabilistic
algorithm that balances exploration and exploitation by maintaining
a posterior distribution over the reward distributions of actions. It
samples a reward distribution from the posterior and selects the
action with the highest expected reward based on the samples.
Thompson Sampling has been shown to achieve sublinear regret in
certain settings...
4. Model-Based Approaches: Model-based algorithms leverage a learned
or assumed model of the environment to optimize actions and achieve
sublinear regret. By using the model to plan and simulate future
actions and rewards, these algorithms can make more informed
decisions and improve performance over time.
UCB Algorithm:
The UCB (Upper Confidence Bound) algorithm is a popular algorithm
used in the context of multi-armed bandit problems to balance
exploration and exploitation.

 It balances between exploring different actions and exploiting


the currently estimated best action based on confidence
bounds.
 It aims to achieve sublinear regret by selecting actions that
maximize the upper confidence bound on their estimated
values.

Here's an overview of how the UCB algorithm works:

1. Initialization: Initialize the estimates of action values, denoted as


Qt(a), for each action "a" to some initial value (e.g., 0).
and Initially paly each action (arm) once
2. Action Selection: At each time step, select the action that maximizes
the upper confidence bound of its estimated value.
The UCB formula typically used is:
UCB(a) = Qt(a) + c * sqrt(log(t) / Nt(a))
 Qt(a): The estimated value of action "a."
 c: A parameter that controls the exploration-exploitation trade-
off. Higher values of c encourage more exploration.
 t: The total number of time steps elapsed.
 N(a): The number of times action "a" has been selected so far.
The term sqrt(log(t) / Nt(a)) represents the exploration term that
decreases as the action is chosen more often, promoting exploitation
of actions with higher estimated values.
3. Action Execution and Reward: Execute the selected action and
observe the reward associated with that action.
4. Update Action Value Estimates: Update the estimated value of the
selected action based on the observed reward:
Qt(a) <- Qt(a) + (1/N(a)) * (reward - Qt(a))
 N(a): Increment the count of times action "a" has been selected.
5. Repeat Steps 2-4: Continue selecting actions, updating action value
estimates, and observing rewards until a specified number of
iterations or time steps.

the number c > 0 controls the degree of exploration. If Nt(a) = 0, then a


is considered to be a maximizing action.
Each time a is selected the uncertainty is presumably reduced:
Nt(a) increments, and, as it appears in the denominator, the
uncertainty term decreases

 It is important to tune the exploration parameter "c"


appropriately for the problem at hand.
 Higher values of "c" encourage more exploration, while lower
values favor exploitation.
 The choice of "c" depends on the problem's characteristics and
the desired trade-off between exploration and exploitation.
KL-UCB:
KL-UCB (Kullback-Leibler Upper Confidence Bound) is an algorithm
used in the context of multi-armed bandit problems to balance
exploration and exploitation based on the Kullback-Leibler
divergence between the estimated and true reward distributions.

Here's an overview of how the KL-UCB algorithm works:

1. Initialization: Initialize the estimates of action values, denoted as


Q(a), for each action "a" to some initial value (e.g., 0
2. Action Selection: At each time step, select the action that maximizes
the upper confidence bound on its estimated value, based on the KL-
UCB formula:
UCB(a) = Q(a) + sqrt((2 * log(t)) / N(a))
 Q(a): The estimated value of action "a."
 t: The total number of time steps elapsed.
 N(a): The number of times action "a" has been selected so far.
The term sqrt((2 * log(t)) / N(a)) represents the exploration term that
encourages actions to be chosen that have high uncertainty (low
count N(a)) or whose estimated value is expected to improve based on
past performance (higher count N(a)).
3. Action Execution and Reward: Execute the selected action and
observe the reward associated with that action.
4. Update Action Value Estimates: Update the estimated value of the
selected action based on the observed reward:
Q(a) <- Q(a) + (1/N(a)) * (reward - Q(a))
 N(a): Increment the count of times action "a" has been selected.
5. Repeat Steps 2-4: Continue selecting actions, updating action value
estimates, and observing rewards until a specified number of
iterations or time steps.
The KL-UCB algorithm uses the Kullback-Leibler divergence to
estimate the confidence bounds on the true reward distributions. By
choosing actions that maximize the upper confidence bound, the
algorithm balances exploration and exploitation, favoring actions
that are likely to have higher true rewards or actions with higher
uncertainty.

The exploration term sqrt((2 * log(t)) / N(a)) decreases as the action is


chosen more often or as more rewards are observed, encouraging the
algorithm to shift towards exploitation as more information is
gathered.

The KL-UCB algorithm has been shown to achieve sublinear regret


bounds under certain assumptions about the reward distributions.
Thompson Sampling:
 Thompson Sampling is a popular algorithm used in the context of
multi-armed bandit problems.

 Thompson Sampling balances exploration and exploitation by


using the beta distribution to sample reward distributions for each
action.

Thompson Sampling algorithm


1. Initialization: Set prior distributions for the unknown reward
distributions associated with each action.
For each I =1,2,3…….K
Si=0,Fi=0
2. Action Selection: drawing a sample from each action's corresponding
posterior distribution.
Draw a sample Qi(t) from beta(si+1,Fi+1)
3. Action Execution and Reward: Execute each action and observe the
reward associated with it.
At=argmax Qi(t)
Oberve the reward rt
4. Update Posterior Distributions: Update the posterior distributions for
each action based on the observed rewards. This involves updating the
parameters of the beta distribution
If rt =1 then
SAt=Sat+1
Else
FAt=Fat+1
5. Repeat Steps 2-4: Continue sampling reward distributions, executing
actions, updating priors based on success and failure , and observing
rewards until a specified number of iterations or time steps.
6. Action Selection for Exploitation: After the initial exploration phase,
once the posterior distributions have been updated, select the action
with the highest expected reward based on the samples drawn from the
posterior distributions.

You might also like