RL Unit 1 - QA
RL Unit 1 - QA
Unit-1
Model Question & Answer
(Aimed to Cover 10 Questions of QB)
Sabyasachi Chakraborty
Mathematical Framework
The PAC learning framework provides the following inequality for the minimum number of
training samples (m):
𝑚 ≥ 𝜖1( 𝑙𝑜𝑔 ∣ 𝐻 ∣ +𝑙𝑜𝑔𝛿1 )
Explanation of Terms:
∣H∣: The size of the hypothesis space (or its complexity, such as the Vapnik-
Chervonenkis (VC) dimension).
ϵ : The error tolerance.
δ : The probability of failure.
This formula illustrates how the complexity of H, the desired accuracy (ϵ), and
confidence (1−δ1 influence the amount of data needed for learning.
Solution :
1. Arms:
Each arm corresponds to an action, and each action (arm) has an associated unknown
probability distribution of rewards. In a simple case, each arm has a fixed expected
reward but an unknown distribution.
2. Action Selection:
The decision-maker must decide which arm to pull at each step. The challenge is to
balance two competing objectives:
o Exploitation: Choose the arm with the highest known reward.
o Exploration: Choose arms that have been tried less frequently in order to
learn more about their reward distribution.
3. Reward Distribution:
Each arm i has a reward distribution with mean μ_i, and the goal is to maximize the
expected cumulative reward by selecting the best arm. The true mean of each arm is
unknown and must be estimated over time.
4. Goal:
Maximize the cumulative reward by selecting arms based on observed rewards, while
balancing exploration (to estimate the true mean rewards) and exploitation (to choose
the best arm based on current knowledge).
The Upper Confidence Bound (UCB) algorithm is a popular method for solving the multi-
armed bandit problem. UCB balances exploration and exploitation by selecting arms based
on an upper confidence bound that considers both the estimated mean reward and the
uncertainty about that estimate.
The UCB algorithm selects the arm with the highest upper confidence bound. This upper
bound is calculated using both the estimated mean reward and its variability. The algorithm
encourages exploration for arms with higher uncertainty and exploitation for arms with
higher estimated rewards.
Sliding Window UCB: This version only considers the most recent rewards, making
it more suitable for environments where the reward distribution changes over time.
UCB1-Tuned: A variant that adapts the exploration term based on the variance of the
rewards.
Conclusion
The UCB Algorithm is a powerful and widely used approach for solving the multi-armed
bandit problem, balancing exploration and exploitation. Its simplicity and strong theoretical
guarantees make it a preferred choice in many applications requiring a trade-off between
exploration and exploitation. By using the confidence bounds, UCB ensures that the
algorithm will quickly exploit the best-performing arms while still exploring less-tried arms
to refine the estimates and improve overall performance.
Bandit algorithms are a class of reinforcement learning algorithms used to solve the multi-
armed bandit problem. These algorithms are designed to balance the exploration of new
options (arms) with the exploitation of known options that provide high rewards. The goal is
to maximize the cumulative reward over time by selecting actions that offer the most benefit
while still learning about less-explored options.
The multi-armed bandit problem is typically framed as a scenario where there are multiple
actions (arms), each associated with an unknown probability distribution of rewards. The
decision maker must choose which action to take in each round to maximize the cumulative
reward. The challenge is to explore the arms (try them out) to estimate their rewards while
also exploiting the best-performing ones to maximize the accumulated reward.
1. Actions (Arms): The set of possible actions or arms to choose from. Each arm has an
associated reward distribution.
2. Exploration: Trying out different arms to gather information about their expected
rewards.
3. Exploitation: Choosing the arm with the highest observed reward based on past trials.
4. Regret: The difference between the reward obtained by the chosen actions and the
reward that would have been obtained by always choosing the best arm.
UCB focuses on arms that are uncertain (based on the size of the confidence interval), thus
encouraging exploration, while still exploiting the arms with high rewards.
3. Thompson Sampling
Exploration vs. Exploitation: Thompson Sampling is a Bayesian approach that
probabilistically selects the arm to pull based on the posterior distributions of the
rewards. It explores arms based on their potential to perform well, using the
distribution of past rewards.
Mathematical Explanation: At each time step:
o Sample from the posterior distribution of each arm.
o Select the arm with the highest sampled value.
Thompson Sampling has shown to perform well empirically and is often used in practice
because it naturally balances exploration and exploitation.
4. Softmax Algorithm
Exploration vs. Exploitation: In the Softmax approach, the probability of selecting an
arm is based on the estimated rewards, but arms with higher rewards are more likely
to be chosen. It introduces randomness into the decision-making process, making it a
more exploratory strategy.
3. A/B Testing
A/B testing is a standard method in marketing and product development where two or more
versions of a webpage or app feature are tested with users to determine which one
performs best. Bandit algorithms are used to adaptively allocate traffic to the better-
performing variations over time.
Use case: In a website redesign, two versions of the homepage are tested to see
which performs better in terms of conversions. Bandit algorithms can automatically
allocate more traffic to the version with a higher conversion rate, while continuing to
test the other version to learn more.
4. Robotics and Autonomous Systems
In robotics, bandit algorithms are used to control exploration and exploitation in tasks such
as path planning, robotic manipulation, or optimizing control parameters. Bandit algorithms
help the robot decide which actions to take based on uncertain information about the
environment.
Use case: A robot exploring an environment uses bandit algorithms to decide which
direction to move in next, balancing the need to explore new areas while also
exploiting areas it has already identified as productive (e.g., with more objects of
interest).
5. Healthcare and Drug Trials
Bandit algorithms are increasingly used in clinical trials to optimize the allocation of patients
to different treatment arms. By selecting the most promising treatments based on observed
outcomes, the algorithm can improve patient outcomes while reducing the number of
patients who receive less effective treatments.
Use case: In a clinical trial for a new drug, patients are dynamically assigned to either
the experimental treatment or a control treatment. Bandit algorithms ensure that
patients who receive the most effective treatments are prioritized.