100% found this document useful (1 vote)

908 views26 pages

RL Unit 1

The document discusses reinforcement learning, including definitions, key concepts, and applications. Reinforcement learning involves an agent interacting with an environment to learn optimal behaviors through trial-and-error and feedback in the form of rewards or penalties. Key elements discussed include the policy, reward signal, value function, and optional environment model. Approaches to reinforcement learning include value-based methods using value functions, policy-based methods directly optimizing policies, and model-based methods using learned models of the environment.

Uploaded by

Prabhavathi Prabha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

908 views26 pages

RL Unit 1

Uploaded by

Prabhavathi Prabha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Reinforcement Learning

UNIT-I
Basics of probability and linear algebra, Definition of a stochastic
multi-armed bandit, Definition of regret, achieving sub linear regret,
UCB algorithm, KL-UCB, Thompson Sampling
Reinforcement Learning :

 It is a feedback based learning method in which a learning agent gets a

reward for each right action and gets up punishment/penalty for each
wrong action.
 The agent learn automatically with these feedback and improves its
performance.
 In Reinforcement Learning the agent interacts with the environment and
explores it.
 The goal of an agent is to get the most reward points and hence it
improve its performance. The example of RL is a self-driving car.


1. What is reinforcement learning? State one practical example.

 Reinforcement learning is a branch of machine learning that focuses on

how an agent can learn to make decisions or take actions in an
environment to maximize its cumulative rewards. It is inspired by the
process of learning through trial and error approach.
 In reinforcement learning, an agent interacts with an environment and
receives feedback in the form of rewards or penalties based on its actions.
 The goal of the agent is to learn a policy or a set of actions that maximize
the expected cumulative reward over time.
 The agent explores the environment, tries different actions, and receives
feedback, allowing it to learn which actions lead to higher rewards and
which ones are less rewards.
 Reinforcement learning involves the use of algorithms and mathematical
models to develop optimal strategies for decision-making.
 The agent learns with the process of hit and trial, and based on the
experience, it learns to perform the task in a better way. Hence, we can
say that "Reinforcement learning is
a type of machine learning method where an intelligent agent
(computer program) interacts with the environment and learns
to act within that." How a Robotic dog learns the movement of his
arms is an example of Reinforcement learning.

 It is a core part of Artificial intelligence, and all AI agent works on the

concept of reinforcement learning. Here we do not need to pre-program
the agent, as it learns from its own experience without any human
intervention.
 The agent learns through a combination of exploration (trying out new
actions to gather information) and exploitation (using its current
knowledge to make decisions).

One key feature of reinforcement learning is the use of a reward signal, which
provides feedback to the agent based on its actions taken. The agent's goal is
to learn to select actions that maximize long-term cumulative rewards, rather
than optimizing for immediate rewards.

 Example:
Suppose there is an AI agent present within a maze
environment, and his goal is to find the diamond. The agent
interacts with the environment by performing some
actions, and based on those actions, the state of the agent
gets changed, and it also receives a reward or penalty as
feedback.
The agent continues doing these three things (take action,
change state/remain in the same state, and get
feedback), and by doing these actions, he learns and explores
the environment.
Applications of Reinforcement learning:
It has applications in various domains including
 Robotics navigation
 Game playing,
 Autonomous vehicles,
 Finance,
 Healthcare
 Marketing strategy control
 Webpage indexing and more

 It has been successfully used to train agents that can play complex
games, control robotic systems, optimize resource allocation, and
make decisions in uncertain and dynamic environments.
(q) State key constituents of reinforcement learning. (Explain key terms in
reinforcement learning.)

The key terms of reinforcement learning are :

Agent(): An entity that can explore the environment and act upon it.
Environment : A situation in which an agent is present or surrounded
by. In RL, we assume the stochastic environment, which means it is
random in nature.
Action : Actions are the moves taken by an agent within the environment.
State : State is a situation returned by the environment after each action
taken by the agent.
Reward : A feedback returned to the agent from the environment to
evaluate the action of the agent.

State key features of reinforcement

learning.

In RL, the agent is not instructed about the environment and what
actions need to be taken.
It is based on the hit and trial process.
The agent takes the next action and changes states according to the
feedback of the previous action.
The agent may get a delayed reward.
The environment is stochastic, and the agent needs to explore it to
reach to get the maximum positive rewards.

(Q) Explain elements of reinforcement learning

Apart from the environment in which agent act, reinforcement

learning system have four main sub elements:
1. Policy
2. Reward signal,
3. Value function, and,
4. A model of the environment (OPTIONAL)

1. Policy: (Rule)
The policy is the core of a reinforcement learning and it is alone
sufficient to determine behavior.
 Policy is an agent behavior function
 It defines agent behavior (action) to take in a given situation
 A policy is a function that maps agent current state to an action
 In general, policies may be stochastic, specifying probabilities
for each action or deterministic
For deterministic policy: a = π(s)
For stochastic policy: π(a | s) = P[At =a | St = s]

2. Reward signal:
 A reward signal defines the goal of a reinforcement learning
problem.
 A reward is a numerical value which sent the environment to
the reinforcement learning agent every time it perform the
action(feedback)
 The goal of RL agent is to maximize the total reward it receives
over the long run.
 The reward signal is the primary basis for altering the policy; if
an action selected by the policy is followed by low reward, then
the policy may be changed to select some other action in that
situation in the future .i.e. the policy changes its behavior based
up on the reward signal.
 In general, reward signals may be stochastic functions of the
state of the environment and the actions taken.

3. The value function:

 Value function is a prediction for future reward
 A value function specifies what is good in the each state
and/or action in the long run.
 Used to evaluate the goodness/badness of state

 The value of a state is the total amount of reward an agent can

expect to accumulate over the future, starting from that state.
 The value function depends on the reward as, without reward,
there could be no value. The goal of estimating values is to
achieve more rewards.
 In fact, the most important component of almost all
reinforcement learning algorithms we consider is a method for
efficiently estimating values
 To select between actions the value is estimated as

V∏( S) = E∏ [ Rt + ۷Rt+1 + ۷2Rt

+1
+ ۷3Rt
+1
+---------/S =S]
t
4.Model of the environment :
 The fourth and final element of some reinforcement learning
systems is a model of the environment
 Model is agent representation of the environment
 The model of the environment is something which define how
the environment will behave when action is performed in a
given state
 For a given state and action the model can predict the next
state and reward.
 Methods for solving reinforcement learning problems that use
models and planning are called model-based methods
 model-free methods that are explicitly trial-and-error learners—
viewed as almost the opposite of planning
 we explore reinforcement learning systems that simultaneously
learn by trial and error
 Modern reinforcement learning extends learning process from
low-level, trial-and-error learning to high-level, deliberative
planning.
TWO fundamental problems in sequential decision Making
1.Reinforcement Learning : (MODEL-FREE METHOD)
 The environment is initially unknown
 The agent interact with the environment
 The agent improves its policy
2.planing :
o A model Of the environment is known
o The agent performs computation with its model (without
any external interaction. i.e when the state and action
given it gives reward and the next state to reach
o The agent improve its policy
(q)Explain approaches to implement reinforcement learning.
OR
Explain value-based, policy-based, and model-based
reinforcement learning
.
There are mainly three ways to implement reinforcement-learning in ML,
which are:

Value-based: The value-based approach is about to find the optimal value

function, which is the maximum value at a state under any policy.
Therefore, the agent expects the long-term return at any state(s) under
policy π.

Policy-based: Policy-based approach is to find the optimal policy for the

maximum future rewards without using the value function. In this
approach, the agent tries to apply such a policy that the action performed
in each step helps to maximize the future reward. The policy-based
approach has mainly two types of policy:
 Deterministic: The same action is produced by the policy (π) at any state.
 Stochastic: In this policy, probability determines the produced action.

Model-based: In the model-based approach, a virtual model is created for the

environment, and the agent explores that environment to learn it. There is no
particular solution or algorithm for this approach because the model representation
is different for each environment.
Model based:
 Policy and /or value function
 Model
Value based:
 No policy
 Value function
Policy based:
 Policy
 No value function
Actor Critic :
 Policy
 Value function
(Q) Difference between Reinforcement Learning and Supervised
Learning

 The Reinforcement Learning and Supervised Learning

both are the part of machine learning, but both types of
learnings are far opposite to each other.
 The RL agents interact with the environment, explore it,
take action, and get rewarded.
 Whereas supervised learning ,algorithms learn from the
labeled dataset and, on the basis of the training, predict
the output.
 The difference table between RL and Supervised learning
is given below

Reinforcement Learning Supervised Learning

RL works by interacting with the Supervised learning works on the existing

Environment. Labeled dataset.
The RL algorithm works like the human brain Supervised Learning works as when a human
Works when making some decisions. Learns things in the supervision of a guide.
There is no labeled dataset is present The labeled dataset is present.
No previous training is provided to the Training is provided to the algorithm so that
learning agent. it can predict the output.
RL helps to take decisions sequentially. In Supervised learning, decisions are made
when input is given.
1.6 Summary

Reinforcement learning is a computational approach to

understanding and automating goal-directed learning and decision
making. It is distinguished from other computational approaches by
its emphasis on learning by an agent from direct interaction with its
Environment, without requiring exemplary supervision or complete
models of the environment.

In our opinion, reinforcement learning is the first field to seriously

address the computational issues that arise when learning from
interaction with an environment in order to achieve long-term goals.

Reinforcement learning uses the formal framework of Markov

decision processes to define the interaction between a learning agent
and its environment in terms of states, actions, and rewards. This
framework is intended to be a simple way of representing essential
features of the artificial intelligence problem. These features include
a sense of cause and effect, a sense of uncertainty and
nondeterminism, and the existence of explicit goals.

The concepts of value and value function are key to most

of the reinforcement learning methods that we consider in this book.
We take the position that value functions are important for efficient
search in the space of policies. The use of value functions
distinguishes reinforcement learning methods from evolutionary
methods that search directly in policy space guided by evaluations of
entire policies.
Basics of probability in reinforcement learning:
Probability is an integral part of reinforcement learning, as it helps agents
make decisions, learn from interactions with the environment, and estimate
the expected outcomes of their actions. RL algorithms use the probability to
optimize policies that maximize cumulative rewards over time.

Here are some basics of probability in the context of reinforcement

learning:

1. Markov Decision Processes (MDPs): Reinforcement learning problems

are often formulated as Markov Decision Processes, which consist of
states, actions, transition probabilities, rewards, and a discount
factor. Transition probabilities represent the likelihood of
transitioning from one state to another after taking a specific action.
2. State Transitions: When an agent takes an action in a specific state,
the environment transitions to a new state based on the transition
probabilities. These probabilities determine the likelihood of moving
to different states. In probabilistic terms, they define a probability
distribution over the next states.
3. Policy: A policy in reinforcement learning represents the agent's
behavior, i.e., the strategy it uses to select actions in different states.
Policies can be deterministic (e.g., always selecting the same action
in a given state) or stochastic (selecting actions based on a probability
distribution).
4. Action Selection: In stochastic policies, the agent selects actions
based on probability distributions. These distributions can be
explicitly defined or implicitly represented by value functions.
Common methods for action selection include softmax, epsilon-
greedy, and Thompson sampling, which use probabilities to balance
and .

5. Transition and Reward Distributions: The transition probabilities

define the likelihood of transitioning to different states, while the
reward distribution represents the probabilities of receiving different
rewards in different states. These distributions are often unknown to
the agent and need to be estimated through interactions with the
environment.

6. Probability Distributions and Sampling: In reinforcement learning,

agents often rely on sampling to estimate probabilities and make
decisions. They sample actions from probability distributions,
transition to new states based on transition probabilities, and
observe rewards from reward distributions. Statistical techniques
like Monte Carlo methods and temporal difference learning utilize
these samples to estimate value functions and improve decision-
making.
7. Exploration and Exploitation: Probability is fundamental to
balancing exploration (trying out different actions to learn more
about the environment) and exploitation (taking actions with high
expected returns). The agent uses probability distributions to make
decisions that account for both exploration and exploitation goals.

These are some of the basic concepts where probability comes into
play in reinforcement learning. Understanding and effectively using
probabilities is essential for agents to learn optimal policies and make
informed decisions in uncertain environments.

Basics of linear algebra :

Linear algebra is an important mathematical framework used in
various aspects of reinforcement learning. Here are some basics of
linear algebra relevant to reinforcement learning:

1. Vectors: Vectors are fundamental in representing and manipulating

quantities in reinforcement learning. In the context of state and
action spaces, vectors are used to represent the states and actions.
For example, a state vector can contain features or observations that
describe the current state of the environment.
2. Matrices: Matrices are rectangular arrays of numbers. They are
commonly used to represent transformations, such as state
transitions and policy representations. In reinforcement learning,
transition matrices can describe the probabilities of transitioning
between states given different actions.
3. Vector Operations: Several operations are performed on vectors in
reinforcement learning. These include addition, subtraction, scalar
multiplication, dot product, and element-wise operations. For
example, the dot product between two vectors can be used to
measure similarity or compute the value of a state-action pair.
4. Matrix Operations: Matrix operations are extensively used in
reinforcement learning algorithms. Some important operations
include matrix multiplication, transpose, inverse, and element-wise
operations. For example, matrix multiplication can be used to
compute the value function update in iterative algorithms like value
iteration or policy evaluation.
5. Eigenvalues and Eigenvectors: Eigenvalues and eigenvectors play a
significant role in many reinforcement learning algorithms. They are
used to analyze and characterize the behavior of linear
transformations, such as state transitions or policy updates.
Eigenvectors associated with the dominant eigenvalues can represent
important directions in the state or action space.
6. Matrix Factorization: Matrix factorization techniques, such as
Singular Value Decomposition (SVD) or Eigenvalue Decomposition,
are utilized in reinforcement learning for dimensionality reduction,
feature extraction, and policy approximation. These techniques
decompose a matrix into constituent parts that capture its essential
properties.
7. Linear Systems and Equations: Linear algebra provides tools to solve
linear systems and equations, which are prevalent in reinforcement
learning. For example, in dynamic programming, Bellman equations
can be represented as a system of linear equations that can be solved
to find optimal value functions.
8. Least Squares Estimation: Linear regression and least squares
estimation techniques are commonly used in reinforcement learning
to approximate value functions, policy parameters, or other function
approximations. These techniques utilize linear algebra concepts to
find the best-fit solutions to minimize the error between predictions
and observed values.

Understanding linear algebra enables reinforcement learning

practitioners to effectively model and manipulate state and action
spaces, analyze algorithms, and develop efficient solutions. It forms
the basis for many advanced concepts and techniques in the field.

Definition of a stochastic multi-armed bandit

 A stochastic multi-armed bandit (MAB) is a classic problem in the field of
reinforcement learning and decision theory.
 It represents a scenario where an agent (often referred to as a bandit) is
faced with a set of "arms" or options, each with an unknown reward
distribution.
 The agent's objective is to maximize its cumulative reward over a series
of actions.
 In a stochastic MAB, the reward distributions associated with each arm
are assumed to be stochastic, meaning that they follow some underlying
probability distribution.
 The rewards obtained from each arm are random variables, and their
distribution may vary across different arms.
 At each time step, the agent selects an arm to pull, and it receives a reward
based on the distribution associated with that arm.
 The agent's goal is to learn the arm with the highest expected reward by
sequentially exploring different arms and exploiting the knowledge
gained so far.
 The challenge in the stochastic multi-armed bandit problem lies in the
exploration-exploitation trade-off.
 On one hand, the agent needs to explore different arms to gather
information about their reward distributions. On the other hand, it should
exploit the knowledge it has gained so far to maximize its expected
reward by favoring the arms with higher estimated rewards.
Algorithms : Several algorithms have been developed to address the stochastic
multi-armed bandit problem, such as UCB (Upper Confidence Bound),
Thompson sampling, and EXP3 (Exponential Weighted Exploration and
Exploitation).

These algorithms employ various strategies to balance exploration and

exploitation based on probabilistic reasoning, enabling the agent to learn and
adapt its decision-making policy over time.

Applications of Stochastic multi-armed bandit:

 Stochastic multi-armed bandit problems have practical applications in

various domains, such as
 online advertising,
 clinical trials,
 recommendation systems
 and resource allocation,
 where decision-makers need to make sequential choices under
uncertainty to optimize long-term rewards.

Best example of exploration & exploitation

Exploration: Exploration is the process of go through the entire subject to learn

all the concepts (objectives ) of the subject

Exploitation: Exploitation is the way of presenting answer to the question in the

examination to get highest marks through the knowledge gained in exploration
process
k-multi-armed bandit:
 Consider the following learning problem. You are faced
repeatedly with a choice among k-different options, or actions.
After each choice you receive a numerical reward chosen from
a stationary probability distribution that depends on the action
you selected.
 Your objective is to maximize the expected total reward over
some time period, for example, over 1000 action selections, or
time steps.
 This is the original form of the k-armed bandit problem, so
named by analogy to a slot machine, or “one-armed bandit,”
except that it has k levers instead of one.
 Each action selection is like a play of one of the slot machine’s
levers, and the rewards are the payoffs for hitting the jackpot.
Through repeated action selections you are to maximize your
winnings by concentrating your actions on the best levers

 Another analogy is that of a doctor choosing between

experimental treatments for a series of seriously ill patients.
Each action is the selection of a treatment, and each reward is
the survival or well-being of the patient.

 In k-armed bandit problem, each of the k actions has an

expected or mean reward given that action is selected;
 let us call this the value of that action. We denote the action
selected on time step t as At, and the corresponding reward as
Rt.
 The value then of an arbitrary action a, denoted q*(a), is the
expected reward given that a is selected:
q*(a)=E[Rt |At=a] .

If you knew the value of each action, then it would be trivial to solve
the k-armed bandit problem: you would always select the action with
highest value
 We denote the estimated value of action a at time step t as
Qt*(a). We would like Qt*(a) to be close to q*(a).

 If you maintain estimates of the action values, then at any time

step there is at least one action whose estimated value is
greatest.We call these the greedy actions.
 When you select one of these actions, we say that you are
exploiting your current knowledge of the values of the actions.

Action-value Methods:
 The methods for estimating the values of actions and for using
the estimates to make action selection decisions, which we
collectively call action-value methods.
 One natural way to estimate this is by averaging the rewards
actually receive

 where 1predicate denotes the random variable that is 1 if

predicate is true and 0 if it is not.
 The simplest action selection rule is to select one of the actions
with the highest estimated value, that is, one of the greedy
actions.
 If there is more than one greedy action, then a selection is made
among them in some arbitrary way, perhaps randomly. We write
this greedy action selection method as

Where argmaxa denotes the action a for which the expression that
follows is maximized.
Greedy action selection always exploits current knowledge to
maximize immediate reward
Definition of regret:
In the context of decision-making and reinforcement learning, regret
is a concept that measures the performance of an agent or algorithm.

regret is defined as the difference between the total expected reward

obtained by the agent and the maximum evaluated reward that could
have been obtained by always choosing the best possible action.

 A lower regret indicates better performance, as it suggests that

the agent is making decisions closer to the optimal ones.

In multi-armed bandit problems, regret is commonly used to evaluate

the performance of algorithms. the regret (denoted as R(T)) after T
time steps is calculated as follows:

R(T) = (max_total_reward) - (total_reward_with_algorithm),

where:

max_total_reward (Expected reward) is the cumulative reward

obtained by always selecting the optimal action at each time step.

 total_reward_with_algorithm(estimated reward) is the cumulative

reward achieved by the bandit algorithm over T time steps.

In the context of the multi-armed bandit problem,

various algorithms and strategies are designed to minimize regret
and efficiently explore the action space to find the best actions.

 The goal of a bandit algorithm is to strike a balance between

exploration (trying out different actions to gather information)
and exploitation (selecting the best-known action) to achieve a
low regret and maximize cumulative rewards over time.
 Different bandit algorithms, such as epsilon-greedy, UCB
(Upper Confidence Bound), and Thompson Sampling, have been
developed with various regret bounds and performance
guarantees.
Achieving sublinear regret:
Achieving sub linear regret is a desirable property in reinforcement
learning and decision-making problems. Sub linear regret means
that as the number of actions or time steps increases, the regret
grows at a rate slower than linear, indicating that the agent's
performance improves over time.

To achieve sublinear regret, several techniques and algorithms are

commonly employed:

1. Exploration-Exploitation Trade-off:Balancing exploration and

exploitation is crucial in achieving sublinear regret. Initially, the
agent needs to explore different options to learn about the reward
distributions of different actions. As it gains knowledge, it should
gradually shift towards exploiting the best actions to maximize
cumulative rewards.
2. Upper Confidence Bound (UCB): UCB algorithms, such as UCB1 and
UCB2, incorporate uncertainty in the estimation of action values.
They use confidence intervals to balance exploration and
exploitation. By assigning higher confidence bounds to less explored
actions, UCB algorithms encourage exploration in the early stages
while converging to the optimal action in the long run.
3. Thompson Sampling: Thompson Sampling is a probabilistic
algorithm that balances exploration and exploitation by maintaining
a posterior distribution over the reward distributions of actions. It
samples a reward distribution from the posterior and selects the
action with the highest expected reward based on the samples.
Thompson Sampling has been shown to achieve sublinear regret in
certain settings...
4. Model-Based Approaches: Model-based algorithms leverage a learned
or assumed model of the environment to optimize actions and achieve
sublinear regret. By using the model to plan and simulate future
actions and rewards, these algorithms can make more informed
decisions and improve performance over time.
UCB Algorithm:
The UCB (Upper Confidence Bound) algorithm is a popular algorithm
used in the context of multi-armed bandit problems to balance
exploration and exploitation.

 It balances between exploring different actions and exploiting

the currently estimated best action based on confidence
bounds.
 It aims to achieve sublinear regret by selecting actions that
maximize the upper confidence bound on their estimated
values.

Here's an overview of how the UCB algorithm works:

1. Initialization: Initialize the estimates of action values, denoted as

Qt(a), for each action "a" to some initial value (e.g., 0).
and Initially paly each action (arm) once
2. Action Selection: At each time step, select the action that maximizes
the upper confidence bound of its estimated value.
The UCB formula typically used is:
UCB(a) = Qt(a) + c * sqrt(log(t) / Nt(a))
 Qt(a): The estimated value of action "a."
 c: A parameter that controls the exploration-exploitation trade-
off. Higher values of c encourage more exploration.
 t: The total number of time steps elapsed.
 N(a): The number of times action "a" has been selected so far.
The term sqrt(log(t) / Nt(a)) represents the exploration term that
decreases as the action is chosen more often, promoting exploitation
of actions with higher estimated values.
3. Action Execution and Reward: Execute the selected action and
observe the reward associated with that action.
4. Update Action Value Estimates: Update the estimated value of the
selected action based on the observed reward:
Qt(a) <- Qt(a) + (1/N(a)) * (reward - Qt(a))
 N(a): Increment the count of times action "a" has been selected.
5. Repeat Steps 2-4: Continue selecting actions, updating action value
estimates, and observing rewards until a specified number of
iterations or time steps.

the number c > 0 controls the degree of exploration. If Nt(a) = 0, then a

is considered to be a maximizing action.
Each time a is selected the uncertainty is presumably reduced:
Nt(a) increments, and, as it appears in the denominator, the
uncertainty term decreases

 It is important to tune the exploration parameter "c"

appropriately for the problem at hand.
 Higher values of "c" encourage more exploration, while lower
values favor exploitation.
 The choice of "c" depends on the problem's characteristics and
the desired trade-off between exploration and exploitation.
KL-UCB:
KL-UCB (Kullback-Leibler Upper Confidence Bound) is an algorithm
used in the context of multi-armed bandit problems to balance
exploration and exploitation based on the Kullback-Leibler
divergence between the estimated and true reward distributions.

Here's an overview of how the KL-UCB algorithm works:

1. Initialization: Initialize the estimates of action values, denoted as

Q(a), for each action "a" to some initial value (e.g., 0
2. Action Selection: At each time step, select the action that maximizes
the upper confidence bound on its estimated value, based on the KL-
UCB formula:
UCB(a) = Q(a) + sqrt((2 * log(t)) / N(a))
 Q(a): The estimated value of action "a."
 t: The total number of time steps elapsed.
 N(a): The number of times action "a" has been selected so far.
The term sqrt((2 * log(t)) / N(a)) represents the exploration term that
encourages actions to be chosen that have high uncertainty (low
count N(a)) or whose estimated value is expected to improve based on
past performance (higher count N(a)).
3. Action Execution and Reward: Execute the selected action and
observe the reward associated with that action.
4. Update Action Value Estimates: Update the estimated value of the
selected action based on the observed reward:
Q(a) <- Q(a) + (1/N(a)) * (reward - Q(a))
 N(a): Increment the count of times action "a" has been selected.
5. Repeat Steps 2-4: Continue selecting actions, updating action value
estimates, and observing rewards until a specified number of
iterations or time steps.
The KL-UCB algorithm uses the Kullback-Leibler divergence to
estimate the confidence bounds on the true reward distributions. By
choosing actions that maximize the upper confidence bound, the
algorithm balances exploration and exploitation, favoring actions
that are likely to have higher true rewards or actions with higher
uncertainty.

The exploration term sqrt((2 * log(t)) / N(a)) decreases as the action is

chosen more often or as more rewards are observed, encouraging the
algorithm to shift towards exploitation as more information is
gathered.

The KL-UCB algorithm has been shown to achieve sublinear regret

bounds under certain assumptions about the reward distributions.
Thompson Sampling:
 Thompson Sampling is a popular algorithm used in the context of
multi-armed bandit problems.

 Thompson Sampling balances exploration and exploitation by

using the beta distribution to sample reward distributions for each
action.

Thompson Sampling algorithm

1. Initialization: Set prior distributions for the unknown reward
distributions associated with each action.
For each I =1,2,3…….K
Si=0,Fi=0
2. Action Selection: drawing a sample from each action's corresponding
posterior distribution.
Draw a sample Qi(t) from beta(si+1,Fi+1)
3. Action Execution and Reward: Execute each action and observe the
reward associated with it.
At=argmax Qi(t)
Oberve the reward rt
4. Update Posterior Distributions: Update the posterior distributions for
each action based on the observed rewards. This involves updating the
parameters of the beta distribution
If rt =1 then
SAt=Sat+1
Else
FAt=Fat+1
5. Repeat Steps 2-4: Continue sampling reward distributions, executing
actions, updating priors based on success and failure , and observing
rewards until a specified number of iterations or time steps.
6. Action Selection for Exploitation: After the initial exploration phase,
once the posterior distributions have been updated, select the action
with the highest expected reward based on the samples drawn from the
posterior distributions.

DL Unit-2
No ratings yet
DL Unit-2
24 pages
ML Unit 4
No ratings yet
ML Unit 4
50 pages
CNS Notes
No ratings yet
CNS Notes
244 pages
Deep Learning Unit-II
No ratings yet
Deep Learning Unit-II
19 pages
Question Bank
No ratings yet
Question Bank
14 pages
Deep Learning-Question Bank-Module-Wise
67% (3)
Deep Learning-Question Bank-Module-Wise
5 pages
Deep Learning R18 Jntuh Lab Manual
0% (1)
Deep Learning R18 Jntuh Lab Manual
21 pages
ML Unit-1
100% (2)
ML Unit-1
12 pages
Unit 2 Machine Learning Notes
100% (1)
Unit 2 Machine Learning Notes
25 pages
Heuristic Search: Dr.M. Nagaratna Professor, Dept - of CSE Jntuceh
No ratings yet
Heuristic Search: Dr.M. Nagaratna Professor, Dept - of CSE Jntuceh
54 pages
Nature Inspired Computing Notes 1
100% (1)
Nature Inspired Computing Notes 1
22 pages
Machine Learning Question Paper Solved ML
No ratings yet
Machine Learning Question Paper Solved ML
55 pages
Question Bank Ann
50% (2)
Question Bank Ann
2 pages
Deep Learning-KTU
No ratings yet
Deep Learning-KTU
6 pages
Deep Learning KCS078
0% (1)
Deep Learning KCS078
2 pages
ML UNIT-4 Notes PDF
100% (1)
ML UNIT-4 Notes PDF
40 pages
LM7 Approximate Inference in BN
No ratings yet
LM7 Approximate Inference in BN
18 pages
RL Unit 5
No ratings yet
RL Unit 5
30 pages
NNDL Technical Publication Notes
No ratings yet
NNDL Technical Publication Notes
81 pages
Unit 4
100% (1)
Unit 4
7 pages
DL Unit - 5
No ratings yet
DL Unit - 5
14 pages
RL Model Question Paper
100% (1)
RL Model Question Paper
1 page
Solving XOR Problem Using DNN AIDS
100% (1)
Solving XOR Problem Using DNN AIDS
4 pages
Assignment 3: Reinforcement Learning Prof. B. Ravindran
100% (1)
Assignment 3: Reinforcement Learning Prof. B. Ravindran
4 pages
Question Bank Beel801 PDF
100% (1)
Question Bank Beel801 PDF
10 pages
DL Unit - 4
No ratings yet
DL Unit - 4
14 pages
Unit 5
No ratings yet
Unit 5
8 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
NN DL
No ratings yet
NN DL
1 page
Reasons For Studying Concepts
100% (1)
Reasons For Studying Concepts
2 pages
DL Unit-3
No ratings yet
DL Unit-3
9 pages
Unit-2.4 Searching With Partial Observations - CSPs - Back Tracking
100% (2)
Unit-2.4 Searching With Partial Observations - CSPs - Back Tracking
42 pages
Question Bank - REINFORCEMENT LEARNING
75% (4)
Question Bank - REINFORCEMENT LEARNING
2 pages
21cs502 Unit 4 Ai Notes Short
No ratings yet
21cs502 Unit 4 Ai Notes Short
32 pages
Reasoning With Default Information
No ratings yet
Reasoning With Default Information
3 pages
NNDL Lab Manual
No ratings yet
NNDL Lab Manual
41 pages
NLP Assignment-1 Solution
No ratings yet
NLP Assignment-1 Solution
4 pages
ASN Notes (1,2,3)
100% (1)
ASN Notes (1,2,3)
49 pages
Unit-3 Unit-3 RL Problems, Prediction and Control P 241111 181426
No ratings yet
Unit-3 Unit-3 RL Problems, Prediction and Control P 241111 181426
15 pages
ccs355 Lab Manual
No ratings yet
ccs355 Lab Manual
24 pages
ccs355 Syllabus NNDL
100% (1)
ccs355 Syllabus NNDL
3 pages
101905CS502H - Neural Networks and Deep Learning - Model Question Paper
100% (1)
101905CS502H - Neural Networks and Deep Learning - Model Question Paper
4 pages
Analytical Learning
No ratings yet
Analytical Learning
42 pages
Web Security Unit 5
No ratings yet
Web Security Unit 5
22 pages
Deep Learning - Question Papers
50% (2)
Deep Learning - Question Papers
7 pages
Deep Learning - AD3501 - Notes - Unit 2 - Convolutional Neural Networks
No ratings yet
Deep Learning - AD3501 - Notes - Unit 2 - Convolutional Neural Networks
36 pages
ML Unit 1
No ratings yet
ML Unit 1
42 pages
Unit-5 Mla
No ratings yet
Unit-5 Mla
22 pages
NN UNIT-1 Complete Notes With 153 Pages
No ratings yet
NN UNIT-1 Complete Notes With 153 Pages
153 pages
Unit 2 (Second Order Methods)
No ratings yet
Unit 2 (Second Order Methods)
9 pages
5.hyperparameters and Validation Sets (C)
No ratings yet
5.hyperparameters and Validation Sets (C)
3 pages
ML Unit-2
100% (1)
ML Unit-2
28 pages
ML Unit-4
No ratings yet
ML Unit-4
17 pages
ML Unit-3
No ratings yet
ML Unit-3
23 pages
ML Unit-1
100% (1)
ML Unit-1
15 pages
Deep Learning Question Paper
100% (1)
Deep Learning Question Paper
3 pages
Deep Learning Question Bank (2024-25)
No ratings yet
Deep Learning Question Bank (2024-25)
2 pages
Unit 4
No ratings yet
Unit 4
56 pages
Cognitive-Computing Unit 1
No ratings yet
Cognitive-Computing Unit 1
13 pages
DEEP LEARNING (Previous Question Papers)
No ratings yet
DEEP LEARNING (Previous Question Papers)
3 pages

RL Unit 1

Uploaded by

RL Unit 1

Uploaded by

Reinforcement Learning

 It is a feedback based learning method in which a learning agent gets a

1. What is reinforcement learning? State one practical example.

 Reinforcement learning is a branch of machine learning that focuses on

 It is a core part of Artificial intelligence, and all AI agent works on the

The key terms of reinforcement learning are :

State key features of reinforcement

(Q) Explain elements of reinforcement learning

Apart from the environment in which agent act, reinforcement

3. The value function:

 The value of a state is the total amount of reward an agent can

V∏( S) = E∏ [ Rt + ۷Rt+1 + ۷2Rt

Value-based: The value-based approach is about to find the optimal value

Policy-based: Policy-based approach is to find the optimal policy for the

Model-based: In the model-based approach, a virtual model is created for the

 The Reinforcement Learning and Supervised Learning

Reinforcement Learning Supervised Learning

RL works by interacting with the Supervised learning works on the existing

Reinforcement learning is a computational approach to

In our opinion, reinforcement learning is the first field to seriously

Reinforcement learning uses the formal framework of Markov

The concepts of value and value function are key to most

Here are some basics of probability in the context of reinforcement

1. Markov Decision Processes (MDPs): Reinforcement learning problems

5. Transition and Reward Distributions: The transition probabilities

6. Probability Distributions and Sampling: In reinforcement learning,

Basics of linear algebra :

1. Vectors: Vectors are fundamental in representing and manipulating

Understanding linear algebra enables reinforcement learning

Definition of a stochastic multi-armed bandit

These algorithms employ various strategies to balance exploration and

Applications of Stochastic multi-armed bandit:

 Stochastic multi-armed bandit problems have practical applications in

Best example of exploration & exploitation

Exploration: Exploration is the process of go through the entire subject to learn

Exploitation: Exploitation is the way of presenting answer to the question in the

 Another analogy is that of a doctor choosing between

 In k-armed bandit problem, each of the k actions has an

 If you maintain estimates of the action values, then at any time

 where 1predicate denotes the random variable that is 1 if

regret is defined as the difference between the total expected reward

 A lower regret indicates better performance, as it suggests that

In multi-armed bandit problems, regret is commonly used to evaluate

R(T) = (max_total_reward) - (total_reward_with_algorithm),

max_total_reward (Expected reward) is the cumulative reward

 total_reward_with_algorithm(estimated reward) is the cumulative

In the context of the multi-armed bandit problem,

 The goal of a bandit algorithm is to strike a balance between

To achieve sublinear regret, several techniques and algorithms are

1. Exploration-Exploitation Trade-off:Balancing exploration and

 It balances between exploring different actions and exploiting

Here's an overview of how the UCB algorithm works:

1. Initialization: Initialize the estimates of action values, denoted as

the number c > 0 controls the degree of exploration. If Nt(a) = 0, then a

 It is important to tune the exploration parameter "c"

Here's an overview of how the KL-UCB algorithm works:

1. Initialization: Initialize the estimates of action values, denoted as

The exploration term sqrt((2 * log(t)) / N(a)) decreases as the action is

The KL-UCB algorithm has been shown to achieve sublinear regret

 Thompson Sampling balances exploration and exploitation by

Thompson Sampling algorithm

You might also like