Bayes' Theorem Explained
Bayes' Theorem Explained
Bayes' theorem is a fundamental concept in probability theory that provides a way to update
the probability of a hypothesis based on new evidence. It relates the prior probability,
likelihood, and posterior probability as follows:
Formula:
P (B∣A) ⋅ P (A)
P (A∣B) =
P (B)
Terms:
1. Prior Probability (P (A)):
The initial belief or probability of the hypothesis (A) before observing any new
evidence.
2. Likelihood (P (B∣A)):
The probability of observing the evidence (B ) assuming the hypothesis (A) is true.
The updated probability of the hypothesis (A) after taking into account the new
evidence (B ).
Explanation:
Bayes' theorem adjusts our prior belief (P (A)) by incorporating how likely the observed
evidence is under the hypothesis (P (B∣A)) and normalizing this by the overall probability of
the evidence (P (B)).
Example:
Suppose a medical test is 95% accurate, and only 1% of people have a particular disease.
1/18
Posterior Probability: Using Bayes' theorem, we calculate the revised probability that a
person has the disease given a positive test result.
This approach helps incorporate evidence (like the test result) to refine our belief about the
disease's likelihood.
How It Works
The algorithm calculates the probability of each class given a set of input features and
selects the class with the highest probability.
n
P (C∣X) ∝ P (C) ∏ P (xi ∣C)
i=1
Where:
Steps:
1. Training Phase:
2/18
Calculate probabilities:
2. Prediction Phase:
For a new email, compute P (Spam∣X) and P (Ham∣X), where X is the set of
words in the email.
Example:
Training probabilities:
Advantages:
Simple and fast to implement.
Limitations:
Assumes feature independence, which may not always hold true.
3/18
Sentiment analysis (e.g., classifying tweets as positive or negative).
Representation of Knowledge in BN
1. Nodes: Represent random variables (discrete or continuous).
3. Conditional Probability Tables (CPTs): Each node contains a CPT that specifies the
probability of the variable given its parents.
The process leverages the structure of the network to simplify computations by only
considering relevant dependencies.
4/18
Comparison with Naïve Bayes Classifier
Aspect Bayesian Belief Network Naïve Bayes Classifier
Graphical Directed Acyclic Graph (DAG). Simple structure with one node for the
Structure class label and direct connections to
features.
Inference Performs reasoning across multiple Focused only on classifying data using
variables and evidence. Bayes' theorem.
2. Complexity: Designing the structure and defining CPTs can be difficult, especially for
domains with many variables or unknown dependencies.
5/18
3. Data Requirements: Requires significant data to estimate probabilities accurately.
4. Inference Cost: Exact inference can be computationally infeasible for complex networks.
Practical Example of BN
Medical Diagnosis:
Variables: Symptoms (e.g., fever, cough), Diseases (e.g., flu, pneumonia), and Test
Results.
Edges: A disease may directly affect certain symptoms, and test results depend on the
disease.
Inference: Given observed symptoms and test results, a BN can calculate the probability
of various diseases and suggest the most likely cause.
Bayesian Belief Networks are powerful tools for reasoning under uncertainty and are used in
fields like medicine, engineering, and artificial intelligence. However, their complexity limits
their scalability in some applications compared to simpler models like Naïve Bayes.
6/18
In a 3D space, the hyperplane is a plane.
SVM tries to find the optimal hyperplane that maximizes the margin between the
classes.
2. Margin:
The margin is the distance between the hyperplane and the nearest data points
from either class.
SVM aims to maximize this margin for better generalization and robustness.
3. Support Vectors:
These are the data points closest to the hyperplane and are critical in defining the
position of the hyperplane.
4. Kernel Trick:
In cases where the data is not linearly separable in its original space, SVM uses
kernel functions to project the data into a higher-dimensional space where a
hyperplane can separate the classes.
Common kernels:
Radial Basis Function (RBF): A Gaussian kernel useful for non-linear problems.
Objective:
1
Minimize: ∣∣w∣∣2
2
Subject to:
yi (w ⋅ xi + b) ≥ 1
∀i
7/18
Where:
The solution involves maximizing the margin while minimizing classification errors.
Soft Margin:
2. Choose a Kernel:
3. Train Model:
Fit the SVM to the training data using the selected kernel.
4. Tune Hyperparameters:
5. Make Predictions:
Advantages of SVM
8/18
1. Effective in High Dimensions: Handles large feature spaces well.
Disadvantages of SVM
1. Computationally Intensive: Training can be slow for large datasets.
3. Memory Usage: Can be high for large datasets due to reliance on support vectors.
Practical Example
Handwriting Recognition:
Approach:
Conclusion
9/18
SVM is a powerful algorithm for classification and regression, particularly for small- to
medium-sized datasets. Its ability to handle non-linear problems via kernels makes it
versatile, though its computational demands and complexity in parameter tuning can be
challenging.
P (D∣h)P (h)
P (h∣D) =
P (D)
Key Concepts
Hypothesis (h): A candidate model or explanation for the data.
Prior (P (h)): The prior probability of the hypothesis, representing prior knowledge or
assumptions about h.
Likelihood (P (D∣h)): The probability of observing the data D given that the hypothesis
h is true.
Posterior (P (h∣D)): The updated probability of h after observing D .
h∈H
Steps:
10/18
1. List Hypotheses:
Generate a finite or countable list of all possible hypotheses in the hypothesis space
(H ).
2. Calculate Posterior:
3. Maximize Posterior:
Identify the hypothesis hMAP that has the highest posterior probability.
Example
Coin Toss:
Suppose you want to determine the bias (h) of a coin (probability of heads) based on 10
observed tosses, where D = {H, H, T , H, T , H, H, T , H, T }.
1. Hypothesis Space:
3. Likelihood (P (D∣h)):
4. Posterior (P (h∣D)):
11/18
Find the hi with the highest P (h∣D).
h∈H
h∈H
Conclusion
The brute force MAP learning algorithm systematically evaluates all hypotheses to find the
most probable one. While accurate for small hypothesis spaces, it is computationally
impractical for large or continuous spaces. Advanced methods like optimization and
sampling techniques are used to handle such cases efficiently.
12/18
Write short note on Hypothesis testing.
Hypothesis Testing
Hypothesis Testing is a statistical method used to make decisions or inferences about a
population based on sample data. It helps assess whether a specific assumption (called a
hypothesis) about a population parameter is supported by evidence.
Example: Ha :μ=
50 (The mean is not 50).
3. Significance Level (α):
A lower α reduces the risk of false positives but increases the risk of false negatives.
4. Test Statistic:
A value calculated from the sample data that is used to decide whether to reject H0 .
5. P-Value:
true.
6. Decision:
13/18
Reject H0 : Evidence supports Ha .
4. Determine the p-value or compare the test statistic with the critical value.
Two-Tailed Test:
Example
A company claims the average life of its batteries is 500 hours (H0 : μ = 500). A sample of
30 batteries shows a mean life of 490 hours with a standard deviation of 20 hours. Using a
significance level of 0.05, hypothesis testing can determine if the claim is valid.
14/18
1. Quality Control: Assessing if a product meets quality standards.
Conclusion
Hypothesis testing is a cornerstone of inferential statistics, providing a systematic framework
to evaluate assumptions about population parameters based on sample data. It aids
decision-making under uncertainty, ensuring conclusions are statistically justified.
Q-Learning
Q-learning is a model-free reinforcement learning algorithm used to train agents to make
decisions in environments where the outcomes of actions are uncertain. It learns the optimal
action-selection policy by estimating the Q-value (quality) of state-action pairs without
requiring prior knowledge of the environment's dynamics.
2. Policy:
A strategy that defines the action an agent should take in each state.
3. Bellman Equation:
[ ]
15/18
Q(s, a) ← Q(s, a) + α [r + γ max
′
Q(s′ , a′ ) − Q(s, a)]
Where:
Balances exploring new actions (to discover better rewards) with exploiting known
actions to maximize reward.
Common strategies: ϵ-greedy, where the agent explores with probability ϵ and
exploits with 1 − ϵ.
Steps in Q-Learning
1. Initialize the Q-table (Q(s, a)) with zeros.
Example
Robot Navigation:
16/18
States: Each grid cell.
Rewards:
The robot learns the optimal path to the goal by iteratively updating the Q-values.
Advantages of Q-Learning
1. Model-Free: No need for a predefined model of the environment.
Disadvantages of Q-Learning
1. Scalability: Inefficient for environments with large state-action spaces (requires a large
Q-table).
3. Convergence Time: Can take a long time to converge to the optimal policy.
Applications of Q-Learning
Game AI: Training agents to play video games.
17/18
Autonomous Vehicles: Optimizing decision-making for pathfinding and traffic
management.
Conclusion
Q-learning is a powerful and simple reinforcement learning algorithm that enables agents to
learn optimal behaviors through trial and error. While it has limitations in scalability, it is
foundational in reinforcement learning and inspires more advanced algorithms like Deep Q-
Networks (DQN).
18/18