0% found this document useful (0 votes)
8 views5 pages

Expanded Multi Armed Bandit and Probability Basics

The document provides an overview of probability theory and linear algebra basics, focusing on random variables, probability distributions, and key concepts such as expectation, variance, and Bayes' theorem. It also introduces the stochastic multi-armed bandit problem, discussing regret, sublinear regret, and various algorithms like UCB, KL-UCB, and Thompson Sampling designed to optimize reward selection. These algorithms balance exploration and exploitation to maximize total rewards over time.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views5 pages

Expanded Multi Armed Bandit and Probability Basics

The document provides an overview of probability theory and linear algebra basics, focusing on random variables, probability distributions, and key concepts such as expectation, variance, and Bayes' theorem. It also introduces the stochastic multi-armed bandit problem, discussing regret, sublinear regret, and various algorithms like UCB, KL-UCB, and Thompson Sampling designed to optimize reward selection. These algorithms balance exploration and exploitation to maximize total rewards over time.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Multi-Armed Bandit Algorithms and

Probability Basics
1. Basics of Probability and Linear Algebra

1.1 Probability Basics


Probability theory is a branch of mathematics that deals with the analysis of random
phenomena. The outcomes of a random experiment are described by random variables, and
the likelihood of these outcomes is quantified using probability distributions.

1. Random Variables:
- A random variable assigns numerical values to the outcomes of a random process. There
are two main types of random variables:
• Discrete Random Variables: These take on a countable number of distinct values. For
example, the outcome of rolling a die (1, 2, 3, 4, 5, 6) is a discrete random variable.
• Continuous Random Variables: These take on an infinite number of possible values
within a range. An example is the height of individuals in a population.

2. Probability Distributions:
- Discrete Probability Distributions are described using a Probability Mass Function
(PMF), which assigns probabilities to each possible outcome. Common discrete
distributions include:
• Binomial Distribution: Models the number of successes in a fixed number of
independent Bernoulli trials.
• Poisson Distribution: Models the number of events occurring within a fixed interval of
time or space.
- Continuous Probability Distributions are described using a Probability Density Function
(PDF). Unlike PMFs, PDFs do not give probabilities directly but rather describe the relative
likelihood of outcomes within an interval. Common continuous distributions include:
• Normal Distribution: Characterized by its bell-shaped curve, often used in natural and
social sciences.
• Exponential Distribution: Models the time between events in a Poisson process.

3. Expectation and Variance:


- The expectation (or mean) of a random variable provides a measure of its central
tendency. For a discrete random variable X, the expectation is given by:
E[X] = Σx P(X=x)
For continuous random variables, it is expressed as:
E[X] = ∫x f(x) dx
- Variance measures the spread of the random variable around its mean, defined as:
Var(X) = E[(X - E[X])²]
A higher variance indicates that the data points are more spread out from the mean.

4. Bayes' Theorem:
- Bayes' theorem provides a way to update the probability of a hypothesis based on new
evidence. It is expressed as:
P(A|B) = (P(B|A) * P(A)) / P(B)
Here, P(A|B) is the posterior probability of A given B, P(B|A) is the likelihood, P(A) is the
prior probability, and P(B) is the marginal likelihood. Bayes' theorem is foundational in
many areas of machine learning and statistical inference.

1.2 Linear Algebra Basics


Linear algebra is the study of vectors, vector spaces, and linear transformations between
them. It provides the mathematical framework for understanding and manipulating data in
multiple dimensions, which is essential in fields like machine learning, computer graphics,
and physics.

1. Vectors and Matrices:


- A vector is an ordered list of numbers that can represent a point in space, a direction, or
any other multidimensional data. For example, a 3-dimensional vector is written as x = (x₁,
x₂, x₃).
- A matrix is a two-dimensional array of numbers arranged in rows and columns. Matrices
are used to represent linear transformations and systems of linear equations. For example,
a 2x2 matrix looks like:
[a b]
[c d]

2. Matrix Operations:
- Addition: Two matrices of the same dimensions can be added together by adding their
corresponding elements.
- Multiplication: Vectors can be multiplied using the dot product, while matrices are
multiplied using matrix multiplication rules.
- Inverse & Transpose: The inverse of a matrix A, denoted A⁻¹, satisfies the equation AA⁻¹
= I, where I is the identity matrix. The transpose of a matrix swaps its rows with its columns.

3. Eigenvalues and Eigenvectors:


- Eigenvalues and eigenvectors are fundamental in understanding linear transformations.
For a square matrix A, if there exists a scalar λ and a non-zero vector v such that:
Av = λv
then λ is an eigenvalue of A, and v is the corresponding eigenvector.
- These concepts are crucial in many applications, including Principal Component Analysis
(PCA), which is used for dimensionality reduction in machine learning.

2. Stochastic Multi-Armed Bandit


The stochastic multi-armed bandit (MAB) problem is a classic example of the exploration-
exploitation dilemma. In this problem, an agent is faced with multiple options (arms), each
providing a reward drawn from an unknown probability distribution. The agent's goal is to
maximize the total reward over time by carefully balancing exploration (trying out different
arms to gather information) and exploitation (choosing the arm that currently seems to
offer the highest reward).

Consider a gambler in front of several slot machines (each representing an arm). Each
machine has a different probability of paying out a reward, but these probabilities are
unknown to the gambler. The challenge is to decide which machine to play at each time step
to maximize the total winnings.

Formally, at each time step t, the agent selects an arm a from a set of K arms and receives a
reward drawn from the corresponding distribution. Over time, the agent must learn which
arms are more rewarding while still exploring enough to ensure that no potentially better
arms are overlooked.

3. Definition of Regret
Regret is a key concept in the analysis of multi-armed bandit algorithms. It quantifies the
difference between the reward obtained by the algorithm and the reward that would have
been obtained by always selecting the best possible arm.

Mathematically, the regret after T time steps is defined as:


Regret(T) = T * μ* - Σ(μₐₜ)
where:
- T is the total number of time steps.
- μ* is the expected reward of the optimal arm.
- μₐₜ is the reward obtained at time t.

The goal of a good bandit algorithm is to minimize regret over time. Ideally, the regret
should grow sublinearly with time, meaning that the average regret per time step decreases
as the algorithm learns more about the arms.
4. Achieving Sublinear Regret
Sublinear regret is a desirable property in multi-armed bandit algorithms, indicating that
the algorithm's performance approaches that of the optimal strategy over time. Achieving
sublinear regret requires a careful balance between exploration and exploitation.

In simple terms:
- Linear Regret (O(T)): If an algorithm chooses arms randomly without learning from past
experiences, the regret will grow linearly with time.
- Sublinear Regret (O(log T) or O(√T)): Efficient algorithms that learn from past experiences
can achieve sublinear regret, meaning the average regret per time step decreases over time.

Algorithms like UCB, KL-UCB, and Thompson Sampling are designed to achieve sublinear
regret by dynamically adjusting the balance between exploring new arms and exploiting
known high-reward arms.

5. Upper Confidence Bound (UCB) Algorithm


The Upper Confidence Bound (UCB) algorithm is a popular method for solving the multi-
armed bandit problem. It operates on the principle of optimism in the face of uncertainty,
meaning it selects the arm with the highest potential reward based on current information
and a confidence interval.

The UCB1 algorithm selects arms according to the following rule:


Aₜ = argmaxₐ ( μ̂ₐ + √(2 * ln(t) / Nₐ) )
where:
- μ̂ₐ is the estimated mean reward of arm a.
- Nₐ is the number of times arm a has been selected.
- t is the current time step.

The term √(2 * ln(t) / Nₐ) represents the exploration bonus, which decreases as an arm is
selected more frequently. This encourages the algorithm to explore less frequently chosen
arms while gradually focusing on the most rewarding ones.

6. KL-UCB (Kullback-Leibler UCB)


KL-UCB is an extension of the UCB algorithm that uses the Kullback-Leibler (KL) divergence
to refine the exploration-exploitation trade-off. The KL divergence measures the difference
between two probability distributions, providing a more nuanced approach to confidence
intervals.
The KL-UCB algorithm selects arms based on the following criterion:
μ̂ₐ = max{ q : Nₐ * D(μ̂ₐ || q) ≤ ln(t) + c }
where D(p || q) is the KL divergence between the estimated distribution p and the candidate
distribution q.

KL-UCB often outperforms the standard UCB algorithm, especially in cases where the
reward distributions are not Gaussian. By leveraging the KL divergence, it achieves tighter
confidence intervals and more efficient exploration.

7. Thompson Sampling
Thompson Sampling is a Bayesian approach to the multi-armed bandit problem that
balances exploration and exploitation through probabilistic inference. It maintains a
probability distribution over the expected rewards of each arm and selects arms based on
samples from these distributions.

The steps involved in Thompson Sampling are as follows:


1. Initialize a prior distribution for each arm's reward probability (commonly a Beta
distribution for Bernoulli rewards).
2. At each time step, sample a reward probability from the posterior distribution for each
arm.
3. Select the arm with the highest sampled reward probability.
4. Update the posterior distribution for the selected arm based on the observed reward.

Thompson Sampling has been shown to perform well in practice, often matching or
exceeding the performance of more complex algorithms. Its probabilistic nature allows it to
adapt effectively to changing environments and non-stationary reward distributions.

You might also like