0% found this document useful (0 votes)

8 views5 pages

Expanded Multi Armed Bandit and Probability Basics

The document provides an overview of probability theory and linear algebra basics, focusing on random variables, probability distributions, and key concepts such as expectation, variance, and Bayes' theorem. It also introduces the stochastic multi-armed bandit problem, discussing regret, sublinear regret, and various algorithms like UCB, KL-UCB, and Thompson Sampling designed to optimize reward selection. These algorithms balance exploration and exploitation to maximize total rewards over time.

Uploaded by

21WH1A6642 KUKKAPALLY PRAHARSHITHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views5 pages

Expanded Multi Armed Bandit and Probability Basics

Uploaded by

21WH1A6642 KUKKAPALLY PRAHARSHITHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Multi-Armed Bandit Algorithms and

Probability Basics
1. Basics of Probability and Linear Algebra

1.1 Probability Basics

Probability theory is a branch of mathematics that deals with the analysis of random
phenomena. The outcomes of a random experiment are described by random variables, and
the likelihood of these outcomes is quantified using probability distributions.

1. Random Variables:
- A random variable assigns numerical values to the outcomes of a random process. There
are two main types of random variables:
• Discrete Random Variables: These take on a countable number of distinct values. For
example, the outcome of rolling a die (1, 2, 3, 4, 5, 6) is a discrete random variable.
• Continuous Random Variables: These take on an infinite number of possible values
within a range. An example is the height of individuals in a population.

2. Probability Distributions:
- Discrete Probability Distributions are described using a Probability Mass Function
(PMF), which assigns probabilities to each possible outcome. Common discrete
distributions include:
• Binomial Distribution: Models the number of successes in a fixed number of
independent Bernoulli trials.
• Poisson Distribution: Models the number of events occurring within a fixed interval of
time or space.
- Continuous Probability Distributions are described using a Probability Density Function
(PDF). Unlike PMFs, PDFs do not give probabilities directly but rather describe the relative
likelihood of outcomes within an interval. Common continuous distributions include:
• Normal Distribution: Characterized by its bell-shaped curve, often used in natural and
social sciences.
• Exponential Distribution: Models the time between events in a Poisson process.

3. Expectation and Variance:

- The expectation (or mean) of a random variable provides a measure of its central
tendency. For a discrete random variable X, the expectation is given by:
E[X] = Σx P(X=x)
For continuous random variables, it is expressed as:
E[X] = ∫x f(x) dx
- Variance measures the spread of the random variable around its mean, defined as:
Var(X) = E[(X - E[X])²]
A higher variance indicates that the data points are more spread out from the mean.

4. Bayes' Theorem:
- Bayes' theorem provides a way to update the probability of a hypothesis based on new
evidence. It is expressed as:
P(A|B) = (P(B|A) * P(A)) / P(B)
Here, P(A|B) is the posterior probability of A given B, P(B|A) is the likelihood, P(A) is the
prior probability, and P(B) is the marginal likelihood. Bayes' theorem is foundational in
many areas of machine learning and statistical inference.

1.2 Linear Algebra Basics

Linear algebra is the study of vectors, vector spaces, and linear transformations between
them. It provides the mathematical framework for understanding and manipulating data in
multiple dimensions, which is essential in fields like machine learning, computer graphics,
and physics.

1. Vectors and Matrices:

- A vector is an ordered list of numbers that can represent a point in space, a direction, or
any other multidimensional data. For example, a 3-dimensional vector is written as x = (x₁,
x₂, x₃).
- A matrix is a two-dimensional array of numbers arranged in rows and columns. Matrices
are used to represent linear transformations and systems of linear equations. For example,
a 2x2 matrix looks like:
[a b]
[c d]

2. Matrix Operations:
- Addition: Two matrices of the same dimensions can be added together by adding their
corresponding elements.
- Multiplication: Vectors can be multiplied using the dot product, while matrices are
multiplied using matrix multiplication rules.
- Inverse & Transpose: The inverse of a matrix A, denoted A⁻¹, satisfies the equation AA⁻¹
= I, where I is the identity matrix. The transpose of a matrix swaps its rows with its columns.

3. Eigenvalues and Eigenvectors:

- Eigenvalues and eigenvectors are fundamental in understanding linear transformations.
For a square matrix A, if there exists a scalar λ and a non-zero vector v such that:
Av = λv
then λ is an eigenvalue of A, and v is the corresponding eigenvector.
- These concepts are crucial in many applications, including Principal Component Analysis
(PCA), which is used for dimensionality reduction in machine learning.

2. Stochastic Multi-Armed Bandit

The stochastic multi-armed bandit (MAB) problem is a classic example of the exploration-
exploitation dilemma. In this problem, an agent is faced with multiple options (arms), each
providing a reward drawn from an unknown probability distribution. The agent's goal is to
maximize the total reward over time by carefully balancing exploration (trying out different
arms to gather information) and exploitation (choosing the arm that currently seems to
offer the highest reward).

Consider a gambler in front of several slot machines (each representing an arm). Each
machine has a different probability of paying out a reward, but these probabilities are
unknown to the gambler. The challenge is to decide which machine to play at each time step
to maximize the total winnings.

Formally, at each time step t, the agent selects an arm a from a set of K arms and receives a
reward drawn from the corresponding distribution. Over time, the agent must learn which
arms are more rewarding while still exploring enough to ensure that no potentially better
arms are overlooked.

3. Definition of Regret
Regret is a key concept in the analysis of multi-armed bandit algorithms. It quantifies the
difference between the reward obtained by the algorithm and the reward that would have
been obtained by always selecting the best possible arm.

Mathematically, the regret after T time steps is defined as:

Regret(T) = T * μ* - Σ(μₐₜ)
where:
- T is the total number of time steps.
- μ* is the expected reward of the optimal arm.
- μₐₜ is the reward obtained at time t.

The goal of a good bandit algorithm is to minimize regret over time. Ideally, the regret
should grow sublinearly with time, meaning that the average regret per time step decreases
as the algorithm learns more about the arms.
4. Achieving Sublinear Regret
Sublinear regret is a desirable property in multi-armed bandit algorithms, indicating that
the algorithm's performance approaches that of the optimal strategy over time. Achieving
sublinear regret requires a careful balance between exploration and exploitation.

In simple terms:
- Linear Regret (O(T)): If an algorithm chooses arms randomly without learning from past
experiences, the regret will grow linearly with time.
- Sublinear Regret (O(log T) or O(√T)): Efficient algorithms that learn from past experiences
can achieve sublinear regret, meaning the average regret per time step decreases over time.

Algorithms like UCB, KL-UCB, and Thompson Sampling are designed to achieve sublinear
regret by dynamically adjusting the balance between exploring new arms and exploiting
known high-reward arms.

5. Upper Confidence Bound (UCB) Algorithm

The Upper Confidence Bound (UCB) algorithm is a popular method for solving the multi-
armed bandit problem. It operates on the principle of optimism in the face of uncertainty,
meaning it selects the arm with the highest potential reward based on current information
and a confidence interval.

The UCB1 algorithm selects arms according to the following rule:

Aₜ = argmaxₐ ( μ̂ₐ + √(2 * ln(t) / Nₐ) )
where:
- μ̂ₐ is the estimated mean reward of arm a.
- Nₐ is the number of times arm a has been selected.
- t is the current time step.

The term √(2 * ln(t) / Nₐ) represents the exploration bonus, which decreases as an arm is
selected more frequently. This encourages the algorithm to explore less frequently chosen
arms while gradually focusing on the most rewarding ones.

6. KL-UCB (Kullback-Leibler UCB)

KL-UCB is an extension of the UCB algorithm that uses the Kullback-Leibler (KL) divergence
to refine the exploration-exploitation trade-off. The KL divergence measures the difference
between two probability distributions, providing a more nuanced approach to confidence
intervals.
The KL-UCB algorithm selects arms based on the following criterion:
μ̂ₐ = max{ q : Nₐ * D(μ̂ₐ || q) ≤ ln(t) + c }
where D(p || q) is the KL divergence between the estimated distribution p and the candidate
distribution q.

KL-UCB often outperforms the standard UCB algorithm, especially in cases where the
reward distributions are not Gaussian. By leveraging the KL divergence, it achieves tighter
confidence intervals and more efficient exploration.

7. Thompson Sampling
Thompson Sampling is a Bayesian approach to the multi-armed bandit problem that
balances exploration and exploitation through probabilistic inference. It maintains a
probability distribution over the expected rewards of each arm and selects arms based on
samples from these distributions.

The steps involved in Thompson Sampling are as follows:

1. Initialize a prior distribution for each arm's reward probability (commonly a Beta
distribution for Bernoulli rewards).
2. At each time step, sample a reward probability from the posterior distribution for each
arm.
3. Select the arm with the highest sampled reward probability.
4. Update the posterior distribution for the selected arm based on the observed reward.

Thompson Sampling has been shown to perform well in practice, often matching or
exceeding the performance of more complex algorithms. Its probabilistic nature allows it to
adapt effectively to changing environments and non-stationary reward distributions.

RL Unit
No ratings yet
RL Unit
595 pages
Ideai Reinforcement Learning
No ratings yet
Ideai Reinforcement Learning
167 pages
Book PDF
No ratings yet
Book PDF
582 pages
Lattimore Szepesvari18bandit Algorithms PDF
No ratings yet
Lattimore Szepesvari18bandit Algorithms PDF
513 pages
RL Sem Ans
No ratings yet
RL Sem Ans
90 pages
Bandit Algorithms
No ratings yet
Bandit Algorithms
596 pages
Master Thesis On Mixed Model Bandits
No ratings yet
Master Thesis On Mixed Model Bandits
73 pages
1.RL Unit 1
No ratings yet
1.RL Unit 1
47 pages
AR23
No ratings yet
AR23
159 pages
Business Statistics 2nd Edition J. K. Sharma Instant Download
100% (2)
Business Statistics 2nd Edition J. K. Sharma Instant Download
81 pages
Multi-Armed Bandits Theory and Applications To Online Learning in Networks
No ratings yet
Multi-Armed Bandits Theory and Applications To Online Learning in Networks
167 pages
Project
No ratings yet
Project
47 pages
Summary
No ratings yet
Summary
48 pages
DLMAIRIL01 Q4-2024 Session3
No ratings yet
DLMAIRIL01 Q4-2024 Session3
47 pages
Bandit Algorithms (Tor Lattimore, Csaba Szepesvári) (Z-Library)
0% (1)
Bandit Algorithms (Tor Lattimore, Csaba Szepesvári) (Z-Library)
537 pages
Lec07 Baysian Opti
No ratings yet
Lec07 Baysian Opti
94 pages
EXP3
No ratings yet
EXP3
36 pages
RL L2 MultiArmedBandits
No ratings yet
RL L2 MultiArmedBandits
44 pages
Bandit
No ratings yet
Bandit
8 pages
A12-Online Learning Short 2020
No ratings yet
A12-Online Learning Short 2020
61 pages
Unit - 1: Probability Linear Algebra
No ratings yet
Unit - 1: Probability Linear Algebra
20 pages
26 Making Decisions
No ratings yet
26 Making Decisions
31 pages
Introduction To Randomized Algorithms
No ratings yet
Introduction To Randomized Algorithms
18 pages
Reading 3-Russo & Van Roy 2014
No ratings yet
Reading 3-Russo & Van Roy 2014
24 pages
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
No ratings yet
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
11 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
Multi-Armed Bandits Epsilon-Greedy Algorithm
No ratings yet
Multi-Armed Bandits Epsilon-Greedy Algorithm
14 pages
MCQ& FB - Unit 1
No ratings yet
MCQ& FB - Unit 1
9 pages
2022 Multiarmed Bandit Algorithms On Zynq System-On-Chip Go Frequentist or Bayesian
No ratings yet
2022 Multiarmed Bandit Algorithms On Zynq System-On-Chip Go Frequentist or Bayesian
14 pages
Aifinal
No ratings yet
Aifinal
15 pages
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
No ratings yet
Auer - Using Ucb For Exploration-Exploitation Tradeoffs
26 pages
Introduction To Bandit Algorithm, Unit1
No ratings yet
Introduction To Bandit Algorithm, Unit1
18 pages
RL Unit 1 - QA
No ratings yet
RL Unit 1 - QA
10 pages
EAS 240 MAB Project Description Spring 2025
No ratings yet
EAS 240 MAB Project Description Spring 2025
10 pages
EE675 Lecture 5
No ratings yet
EE675 Lecture 5
6 pages
Assignment 1: CS747: F I L A
No ratings yet
Assignment 1: CS747: F I L A
10 pages
Probabilistic Machine Learning An Introduction Book 1 (Kevin P Murphy)
100% (1)
Probabilistic Machine Learning An Introduction Book 1 (Kevin P Murphy)
949 pages
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
No ratings yet
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
19 pages
CS181 P - A - : Roject New Exploration of The Multi Armed Bandit Problem
No ratings yet
CS181 P - A - : Roject New Exploration of The Multi Armed Bandit Problem
9 pages
Multi-Armed Bandits
No ratings yet
Multi-Armed Bandits
11 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
No ratings yet
Introduction To Bandits: (Some Slides Stolen From Csaba's AAAI Tutorial)
16 pages
Experiment 6
No ratings yet
Experiment 6
7 pages
IntroMulti Armed Bandits Slivkin Microsoft PDF
No ratings yet
IntroMulti Armed Bandits Slivkin Microsoft PDF
174 pages
MAB Assignment 2
No ratings yet
MAB Assignment 2
2 pages
Chapter 9 Bayesian Methods - Machine Learning For Factor Investing
No ratings yet
Chapter 9 Bayesian Methods - Machine Learning For Factor Investing
11 pages
Fundamentals of Reinforcement Learning Learning Objectives
No ratings yet
Fundamentals of Reinforcement Learning Learning Objectives
3 pages
Lecture 9: Exploration and Exploitation: David Silver
No ratings yet
Lecture 9: Exploration and Exploitation: David Silver
47 pages
Mab Notes
No ratings yet
Mab Notes
15 pages
Online Learning For Causal Bandits
No ratings yet
Online Learning For Causal Bandits
7 pages
325 Notes
No ratings yet
325 Notes
23 pages
Unit II
No ratings yet
Unit II
10 pages
Data Challenge - NC Soft
No ratings yet
Data Challenge - NC Soft
4 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
No ratings yet
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
32 pages
Reliability Engineering Principles
100% (3)
Reliability Engineering Principles
21 pages
001-2023-0929 DLMDSAS01 Course Book
No ratings yet
001-2023-0929 DLMDSAS01 Course Book
224 pages
Normal Distribution With Solved Examples
No ratings yet
Normal Distribution With Solved Examples
70 pages
An Introduction To Probabilistic Seismic Hazard Analysis (PSHA)
No ratings yet
An Introduction To Probabilistic Seismic Hazard Analysis (PSHA)
72 pages
(International Series On Actuarial Science) David C. M. Dickson - Insurance Risk and Ruin-Cambridge University Press (2016)
No ratings yet
(International Series On Actuarial Science) David C. M. Dickson - Insurance Risk and Ruin-Cambridge University Press (2016)
306 pages
4.1 TYBSc Mathematics Syllabus Mumbai University 2013-14 Credit System
No ratings yet
4.1 TYBSc Mathematics Syllabus Mumbai University 2013-14 Credit System
41 pages
PTSP
No ratings yet
PTSP
74 pages
C Textbook Sample Spring 2010
No ratings yet
C Textbook Sample Spring 2010
46 pages
Reliab 1
No ratings yet
Reliab 1
16 pages
Stat and Prob New
No ratings yet
Stat and Prob New
65 pages
CN Unit-2 Part 2
No ratings yet
CN Unit-2 Part 2
28 pages
Management Science L5
No ratings yet
Management Science L5
11 pages
CN Unit-5
No ratings yet
CN Unit-5
30 pages
CN Unit-1 Part 2
No ratings yet
CN Unit-1 Part 2
15 pages
CN Unit - 4
No ratings yet
CN Unit - 4
40 pages
IARE P&S Lecture Notes 0
No ratings yet
IARE P&S Lecture Notes 0
71 pages
Random Processes Ma1254
No ratings yet
Random Processes Ma1254
17 pages
Applications of The Double Integral
No ratings yet
Applications of The Double Integral
13 pages
Examples 2
No ratings yet
Examples 2
18 pages
HTHP14 Acceptation PDF
No ratings yet
HTHP14 Acceptation PDF
36 pages
Economics of European Integration Take Home. FINAL 2
No ratings yet
Economics of European Integration Take Home. FINAL 2
23 pages
Theory & Practice of Electromagnetic Design of DC Motors & Actuators George P. Gogue & Joseph J. Stupak, Jr. G2 Consulting, Beaverton, OR 97007
No ratings yet
Theory & Practice of Electromagnetic Design of DC Motors & Actuators George P. Gogue & Joseph J. Stupak, Jr. G2 Consulting, Beaverton, OR 97007
13 pages
0 Maximum Beamer
No ratings yet
0 Maximum Beamer
9 pages
MA2216/ST2131 Probability Notes 5 Distribution of A Function of A Random Variable and Miscellaneous Remarks
No ratings yet
MA2216/ST2131 Probability Notes 5 Distribution of A Function of A Random Variable and Miscellaneous Remarks
13 pages
1314sem1-Ma2216 ST2131
No ratings yet
1314sem1-Ma2216 ST2131
3 pages
M.Tech DCN 2014 SCHEME
No ratings yet
M.Tech DCN 2014 SCHEME
47 pages
Random Variables 2019
No ratings yet
Random Variables 2019
25 pages
Mathematical Background: A.1 C A O N
No ratings yet
Mathematical Background: A.1 C A O N
7 pages
YouTube Analysis
No ratings yet
YouTube Analysis
3 pages
Stats Probability Cheat Sheat Exam 2
100% (1)
Stats Probability Cheat Sheat Exam 2
2 pages
Cowan Statistical Data Analysis
No ratings yet
Cowan Statistical Data Analysis
10 pages
Fast Generation of Deviates For Order Statistics by An Exact Method
No ratings yet
Fast Generation of Deviates For Order Statistics by An Exact Method
9 pages
Probability and Random Variables Syllabus
No ratings yet
Probability and Random Variables Syllabus
2 pages
Two Sections Were Given Introduction To Statistics Examinations. The Following Information Was
No ratings yet
Two Sections Were Given Introduction To Statistics Examinations. The Following Information Was
2 pages
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
Analytical Methods of Optimization
From Everand
Analytical Methods of Optimization
D. F. Lawden
No ratings yet
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Exercises of Statistical Inference
From Everand
Exercises of Statistical Inference
Simone Malacrida
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Expanded Multi Armed Bandit and Probability Basics

Uploaded by

Expanded Multi Armed Bandit and Probability Basics

Uploaded by

Multi-Armed Bandit Algorithms and

1.1 Probability Basics

3. Expectation and Variance:

1.2 Linear Algebra Basics

1. Vectors and Matrices:

3. Eigenvalues and Eigenvectors:

2. Stochastic Multi-Armed Bandit

Mathematically, the regret after T time steps is defined as:

5. Upper Confidence Bound (UCB) Algorithm

The UCB1 algorithm selects arms according to the following rule:

6. KL-UCB (Kullback-Leibler UCB)

The steps involved in Thompson Sampling are as follows:

You might also like