Module 2 Math Foundation II
Module 2 Math Foundation II
ng
CEN524: MATHEMATICAL
FOUNDATIONS FOR AI
BY
Omoruyi O., Olatimehin O. and Ajilore A.
Module 2
Outline II
• Introduction to Probability and Statistics
- Define probability and statistics, and their importance in AI
- Introduce Bayes' theorem and its significance
• Bayes' Theorem
- Explain the concept of Bayes' theorem and its application
- Provide examples and practice problems
• Distributions
- Introduce different types of distributions, such as Gaussian, Bernoulli, and Poisson
- Explain their importance in AI and provide examples
• Statistical Inference
- Introduce the concept of statistical inference and its application in AI
- Explain the importance of hypothesis testing and confidence intervals
2
Outline III
• Introduction to Calculus
- Define calculus and its importance in AI
- Introduce the concept of gradients and optimization
• Gradients
- Explain the concept of gradients and their application in optimization
- Provide examples and practice problems
• Optimization Basics
- Introduce the concept of optimization and its application in AI
- Explain the importance of convex optimization and local minima
• Applications of Calculus in AI
- Explain the application of calculus in AI, such as in deep learning and neural networks
- Provide examples and case studies
3
Defining Probability and Statistics, and Their Importance in AI
• Probability: The study of uncertainty, quantifying the likelihood of
events (e.g., a coin landing heads has a 50% chance). It provides a
mathematical framework to model randomness.
• Statistics: The science of collecting, analyzing, and interpreting data
to uncover patterns or make predictions.
• Importance in AI:
§ AI systems rely on probability to handle uncertainty (e.g., predicting
outcomes in self-driving cars).
§ Statistics enables data-driven decisions, model evaluation, and learning
from patterns (e.g., training machine learning models on datasets).
§ Together, they form the backbone of algorithms like neural networks,
decision trees, and reinforcement learning.
4
Bayes' Theorem
A fundamental rule in probability that updates beliefs based on
new evidence. Enables reasoning under uncertainty (e.g., spam
email detection).
• P(A|B) = [P(B|A) · P(A)] / P(B)
• ( P(A|B) ): Posterior probability (probability of A given B).
• ( P(B|A) ): Likelihood (probability of B given A).
• ( P(A) ): Prior probability (initial belief about A).
• ( P(B) ): Evidence (normalizing constant).
• Foundation for probabilistic models like Naive Bayes classifiers
and Bayesian networks.
5
Bayes' Theorem and Its Application
Bayes' theorem reverses conditional probabilities, allowing us to update probabilities as
new data arrives.
Example 1: If 1% of people have a disease (P(A) = 0.01), a test is 95% accurate (P(B|A) =
0.95), and 10% false positive rate (P(B|¬A) = 0.10), what's the probability of having the
disease given a positive test (P(A|B))?
Use Bayes:
P(A|B) = [P(B|A) · P(A)] / P(B) where
P(B) = P(B|A) · P(A) + P(B|¬A) · P(¬A)
Example 2: A disease affects 2% of people (P(D) = 0.02). A test is 95% accurate for
positives (P(T|D) = 0.95) and has a 10% false positive rate (P(T|¬D) = 0.10). If the test is
positive, what's P(D|T)?
P(T) = P(T|D) · P(D) + P(T|¬D) · P(¬D) = (0.95 · 0.02) + (0.10 · 0.98) = 0.019 + 0.098 =
0.117
P(D|T) = [P(T|D) · P(D)] / P(T) = 0.019 / 0.117 ≈ 0.162 (16.2%).
6
Bayes Practice Problems
Problem 1: A robot predicts rain with 80% accuracy.
If it rains 30% of the time, what’s the probability it’s
raining given the robot predicts rain?
• P(Rain|Predict) = (P(Predict|Rain) · P(Rain)) /
P(Predict)
Problem 2: A model detects fraud with 90% accuracy.
Fraud occurs in 2% of transactions. If the model flags
a transaction, what’s the chance it’s fraud?
7
Different Types of Distributions
Gaussian (Normal) Distribution: Continuous, bell-shaped distribution
defined by mean (μ) and variance (σ²). Example: Test scores or sensor noise.
Bernoulli Distribution: Discrete, models a single trial with two outcomes
(success = 1 with probability p, failure = 0). Example: Whether a user clicks a
link.
Poisson Distribution: Discrete, models the number of events in a fixed interval,
parameterized by λ (average rate). Example: Number of emails received per
hour.
Gaussian: Assumed in many models (e.g., linear regression) and used for data
normalization.
Bernoulli: Core to binary classification tasks (e.g., fraud detection).
Poisson: Useful for modeling event frequencies (e.g., traffic prediction).
8
Probability Distribution
9
Probability Distribution in AI
• Gaussian in AI: Neural networks often assume input
features are normally distributed after standardization.
Example: Predicting house pr ices with normally
distributed errors.
• Bernoulli in AI: Used in logistic regression to predict
binary outcomes. Example: Classifying an image as “cat”
or “not cat.”
• Poisson in AI: Models rare events in time-series data.
Example: Predicting server failures based on historical
crash rates.
10
Statistical Inference and Its Application in AI
Statistical Inference is the process of using sample data to make
generalizations about a population. Includes estimating
parameters (e.g., mean accuracy of a model) and testing
hypotheses.
Application in AI:
• Parameter Estimation: Inferring weights in a machine
learning model from training data.
• M o d e l E v a l u a t i o n : D e t e r m i n i n g i f a n A I s y s t e m ’s
performance is due to skill or chance. Example: Inferring
customer preferences from a sample of purchase data.
11
Hypothesis Testing
A method to test claims about data using statistical
evidence.
Process:
1. State null hypothesis (H₀ , e.g., "no difference in model
performance"),
2. compute a test statistic,
3. and compare to a threshold (p-value < α, typically 0.05).
• Example: Test if a new AI algorithm outperforms an old
one.
12
Confidence Interval
A range estimating a parameter with a confidence level
( e. g. , 9 5 % C I f o r a c c u ra c y : 8 8 % – 9 2 % ) . I n d i c a t e s
uncertainty in estimates.
• Example: Reporting an AI’s error rate with a range to
show reliability.
Importance in AI:
• Hypothesis testing validates improvements (e.g., “Is this
model significantly better?”).
• Confidence intervals quantify uncertainty, ensuring
trustworthy AI deployment.
13
Conclusion on Probability and Statistics
• Probability and statistics are essential for AI to model
uncertainty, learn from data, and evaluate performance.
• Bayes' theorem enables adaptive reasoning, critical for
real-time AI applications.
• Distributions like Gaussian, Bernoulli, and Poisson
underpin data modeling in AI tasks.
• Statistical inference, hypothesis testing, and confidence
intervals ensure AI systems are robust and reliable.
14
Calculus
The mathematical study of change and accumulation, divided into:
• Differential Calculus: Focuses on rates of change (e.g., slopes, derivatives).
• Integral Calculus: Deals with accumulation (e.g., areas, sums over intervals).
Importance in AI:
• Enables optimization of models by finding minima or maxima (e.g.,
minimizing error in machine learning).
• Powers gradient-based methods, the backbone of training algorithms like
neural networks.
• Helps model continuous relationships in data, critical for tasks like
regression and deep learning.
15
Gradient
A vector of partial derivatives representing the
direction and rate of steepest increase of a function.
For a function f(x, y), the gradient is ∇f = (∂f/∂x,
∂f/∂y).
• Optimization: The process of finding the best
solution (e.g., minimum or maximum) of a function,
often called the objective or loss function in AI.
• Connection: Gradients guide optimization by
indicating how to adjust variables to reduce error
or improve performance.
16
Optimization
The gradient points uphill; its negative points downhill. In optimization,
we follow the negative gradient to minimize a function.
• Example: For f(x) = x², the derivative is f'(x) = 2x. At x = 2, the
gradient is 4, so moving opposite (downhill) reduces f(x).
• Application in Optimization:
Gradient Descent: Iteratively update parameters: xₙ ₑₓ = xₒₗ ₖ - η•∇f,
where η is the learning rate.
• Used to minimize loss functions in AI (e.g., mean squared error in
regression).
• Intuition: Think of gradient descent as a hiker descending a foggy
mountain by feeling the steepest slope underfoot.
17
Examples and Practice Problems
Example 1: Minimize f(x) = x² + 2x + 1.
• Derivative: f'(x) = 2x + 2.
• Set f'(x) = 0: 2x + 2 = 0, so x = -1 (minimum).
• Gradient descent: Start at x = 1, f'(1) = 4, step with η
= 0.1: x = 1 - 0.1•4 = 0.6.
Practice Problem: Use gradient descent to minimize
f(x) = 3x² - 6x + 5. Start at x = 0, η = 0.1, 2 steps.
18
Optimization and Its Application in AI
Finding the parameter values that minimize (or maximize) an
objective function. In AI, the objective is often a loss function
(e.g., difference between predicted and actual values).
Application in AI:
• Linear Regression: Minimize squared error to fit a line to data.
• Neural Networks: Adjust weights to minimize prediction error.
• Reinforcement Learning: Maximize cumulative reward.
• Example: Training a model to predict house prices by
minimizing the error between predicted and actual prices.
19
Convex Optimization:
A function is convex if any line segment between two
points on its graph lies above or on the graph (e.g., f(x) =
x²).
Importance:
• Guarantees a single global minimum, making
optimization reliable and efficient.
• In AI: Convex loss functions (e.g., in logistic regression)
ensure gradient descent finds the best solution.
• Local Minima: Points where the function is lower than
nearby points but not the global minimum.
20
Calculus in AI
• Forward Pass: Compute predictions using a function of inputs and weights.
• Backward Pass (Backpropagation): Use gradients to update weights by
minimizing loss.
• Chain rule: ∂L/∂w = ∂L/∂y·∂y/∂w propagates errors backward.
Deep Learning:
Neural networks are compositions of functions (layers), and calculus optimizes
millions of parameters.
• Example: In a 3-layer network, gradients adjust weights to reduce
classification error.
Neural Networks:
• Loss function (e.g., cross-entropy) is differentiated w.r.t. each weight,
enabling learning.
21
Example:
Linear Regression:
• Loss: L = (1/n)∑(yᵢ - (wxᵢ + b))²
• Gradients: ∂L/∂w = -(2/n)∑xᵢ(yᵢ - (wxᵢ + b)),
∂L/∂b = -(2/n)∑(yᵢ - (wxᵢ + b))
• Optimization: Gradient descent fits the line.
22
Case Study
Image Classification with CNNs:
• Convolutional Neural Networks (CNNs) use calculus to
optimize filters and weights.
• Loss: Cross-entropy between predicted and true labels.
• Backpropagation adjusts millions of parameters to recognize
patterns (e.g., edges, shapes).
GPT Models:
• Transformer-based models (like ChatGPT) rely on gradient
descent to optimize attention weights, enabling language
understanding.
23
References
• Grok – x.com/i/Grok
24