Expanded Multi Armed Bandit and Probability Basics
Expanded Multi Armed Bandit and Probability Basics
Probability Basics
1. Basics of Probability and Linear Algebra
1. Random Variables:
- A random variable assigns numerical values to the outcomes of a random process. There
are two main types of random variables:
• Discrete Random Variables: These take on a countable number of distinct values. For
example, the outcome of rolling a die (1, 2, 3, 4, 5, 6) is a discrete random variable.
• Continuous Random Variables: These take on an infinite number of possible values
within a range. An example is the height of individuals in a population.
2. Probability Distributions:
- Discrete Probability Distributions are described using a Probability Mass Function
(PMF), which assigns probabilities to each possible outcome. Common discrete
distributions include:
• Binomial Distribution: Models the number of successes in a fixed number of
independent Bernoulli trials.
• Poisson Distribution: Models the number of events occurring within a fixed interval of
time or space.
- Continuous Probability Distributions are described using a Probability Density Function
(PDF). Unlike PMFs, PDFs do not give probabilities directly but rather describe the relative
likelihood of outcomes within an interval. Common continuous distributions include:
• Normal Distribution: Characterized by its bell-shaped curve, often used in natural and
social sciences.
• Exponential Distribution: Models the time between events in a Poisson process.
4. Bayes' Theorem:
- Bayes' theorem provides a way to update the probability of a hypothesis based on new
evidence. It is expressed as:
P(A|B) = (P(B|A) * P(A)) / P(B)
Here, P(A|B) is the posterior probability of A given B, P(B|A) is the likelihood, P(A) is the
prior probability, and P(B) is the marginal likelihood. Bayes' theorem is foundational in
many areas of machine learning and statistical inference.
2. Matrix Operations:
- Addition: Two matrices of the same dimensions can be added together by adding their
corresponding elements.
- Multiplication: Vectors can be multiplied using the dot product, while matrices are
multiplied using matrix multiplication rules.
- Inverse & Transpose: The inverse of a matrix A, denoted A⁻¹, satisfies the equation AA⁻¹
= I, where I is the identity matrix. The transpose of a matrix swaps its rows with its columns.
Consider a gambler in front of several slot machines (each representing an arm). Each
machine has a different probability of paying out a reward, but these probabilities are
unknown to the gambler. The challenge is to decide which machine to play at each time step
to maximize the total winnings.
Formally, at each time step t, the agent selects an arm a from a set of K arms and receives a
reward drawn from the corresponding distribution. Over time, the agent must learn which
arms are more rewarding while still exploring enough to ensure that no potentially better
arms are overlooked.
3. Definition of Regret
Regret is a key concept in the analysis of multi-armed bandit algorithms. It quantifies the
difference between the reward obtained by the algorithm and the reward that would have
been obtained by always selecting the best possible arm.
The goal of a good bandit algorithm is to minimize regret over time. Ideally, the regret
should grow sublinearly with time, meaning that the average regret per time step decreases
as the algorithm learns more about the arms.
4. Achieving Sublinear Regret
Sublinear regret is a desirable property in multi-armed bandit algorithms, indicating that
the algorithm's performance approaches that of the optimal strategy over time. Achieving
sublinear regret requires a careful balance between exploration and exploitation.
In simple terms:
- Linear Regret (O(T)): If an algorithm chooses arms randomly without learning from past
experiences, the regret will grow linearly with time.
- Sublinear Regret (O(log T) or O(√T)): Efficient algorithms that learn from past experiences
can achieve sublinear regret, meaning the average regret per time step decreases over time.
Algorithms like UCB, KL-UCB, and Thompson Sampling are designed to achieve sublinear
regret by dynamically adjusting the balance between exploring new arms and exploiting
known high-reward arms.
The term √(2 * ln(t) / Nₐ) represents the exploration bonus, which decreases as an arm is
selected more frequently. This encourages the algorithm to explore less frequently chosen
arms while gradually focusing on the most rewarding ones.
KL-UCB often outperforms the standard UCB algorithm, especially in cases where the
reward distributions are not Gaussian. By leveraging the KL divergence, it achieves tighter
confidence intervals and more efficient exploration.
7. Thompson Sampling
Thompson Sampling is a Bayesian approach to the multi-armed bandit problem that
balances exploration and exploitation through probabilistic inference. It maintains a
probability distribution over the expected rewards of each arm and selects arms based on
samples from these distributions.
Thompson Sampling has been shown to perform well in practice, often matching or
exceeding the performance of more complex algorithms. Its probabilistic nature allows it to
adapt effectively to changing environments and non-stationary reward distributions.