ML Unit-3
ML Unit-3
Probability is a way of measuring the likelihood that something will happen. It's a fundamental
concept in mathematics and statistics, and it has applications in many areas of life, from science and
engineering to finance and gambling. Here are some of the basic concepts of probability:
1. Experiment: An experiment is any process that produces an outcome. For example, flipping a coin,
rolling a die, or drawing a card from a deck are all experiments.
2. Sample Space: The sample space of an experiment is the set of all possible outcomes. For
example, the sample space for flipping a coin is {Heads, Tails}, and the sample space for rolling a die
is {1, 2, 3, 4, 5, 6}.
3. Event: An event is a subset of the sample space. For example, the event "rolling an even number"
on a die is the set {2, 4, 6}.
4. Probability: The probability of an event is a number between 0 and 1 that measures how likely the
event is to occur. A probability of 0 means that the event is impossible, and a probability of 1 means
that the event is certain.
5. Calculating Probability: If all outcomes in a sample space are equally likely, the probability of an
event is calculated as:
For example, the probability of rolling a 4 on a fair die is 1/6, because there is one favorable outcome
(rolling a 4) and six possible outcomes (1, 2, 3, 4, 5, 6).
Addition Rule: If two events A and B are mutually exclusive (they cannot both occur), the
probability of either A or B occurring is the sum of their individual probabilities: P(A or B) =
P(A) + P(B).
Multiplication Rule: If two events A and B are independent (the outcome of one does not
affect the outcome of the other), the probability of both A and B occurring is the product of
their individual probabilities: P(A and B) = P(A) * P(B).
Complement Rule: The complement of an event A is the event that A does not occur. The
probability of the complement of A is 1 minus the probability of A: P(not A) = 1 - P(A).
Descriptive Statistics: Tools like mean, median, standard deviation, and histograms help
summarize and visualize data, revealing patterns, anomalies, and distributions. This
understanding is crucial for data cleaning, transformation, and feature engineering.
Data Distributions: Understanding the distribution of data (e.g., normal, skewed) informs the
choice of appropriate algorithms and preprocessing techniques.
Handling Missing Data: Statistical methods like imputation (mean, median, or regression
imputation) help address missing values in datasets.
Outlier Detection: Statistical techniques help identify and handle outliers that can skew
model training.
Correlation Analysis: Statistical measures like Pearson's correlation coefficient help identify
relationships between variables, aiding in feature selection and dimensionality reduction.
ANOVA and Chi-Square Tests: These tests help in feature selection by assessing the
significance of categorical variables.
Regression Analysis: Statistical regression techniques (linear, logistic, etc.) are used for
predictive modeling.
Bayesian Statistics: Provides a framework for updating beliefs about model parameters as
more data becomes available.
Statistical Hypothesis Testing: Used to compare the performance of different models and
determine if the observed differences are statistically significant.
Performance Metrics: Many evaluation metrics are based on statistical concepts (e.g.,
precision, recall, F1-score, ROC curves).
Confidence Intervals: Provide a range of values within which the true model performance is
likely to fall.
In essence, statistical tools provide the mathematical foundation for machine learning. They enable
us to:
Concept of probability:
Probability is a way of quantifying the likelihood of an event occurring. It's a fundamental concept in
mathematics, statistics, and various fields like science, finance, and gambling. Here's a breakdown of
the core ideas:
Probability deals with situations where outcomes are uncertain or random. This means that
even though we might know the possible outcomes, we can't predict with absolute certainty
which one will occur in a single trial.
Examples: Flipping a coin (heads or tails), rolling a die (1 to 6), drawing a card from a deck.
Sample Space: The set of all possible outcomes of an experiment (e.g., for a coin flip: {Heads,
Tails}).
Event: A specific outcome or set of outcomes within the sample space (e.g., "getting heads"
is an event).
3. Measuring Likelihood:
Values between 0 and 1 represent varying degrees of likelihood. For instance, 0.5 means the
event is equally likely to happen or not happen.
4. Calculating Probability:
Classical Probability: When all outcomes in the sample space are equally likely, the
probability of an event is calculated as:
Example: The probability of rolling a 3 on a fair six-sided die is 1/6 because there's one "3" and six
possible outcomes (1, 2, 3, 4, 5, 6).
Empirical Probability: When outcomes are not equally likely, or when we have data from
repeated trials, we can estimate probability based on observed frequencies:
Example: If you flip a coin 100 times and get heads 55 times, the empirical probability of getting
heads is 55/100 = 0.55.
5. Key Concepts and Rules:
Complementary Events: The complement of an event A is the event that A does not occur.
The probability of the complement is 1 - P(A).
Mutually Exclusive Events: Events that cannot both occur at the same time. If A and B are
mutually exclusive, then P(A or B) = P(A) + P(B).
Independent Events: Events where the occurrence of one does not affect the probability of
the other. If A and B are independent, then P(A and B) = P(A) * P(B).
Conditional Probability: The probability of an event A occurring given that another event B
has already occurred, denoted as P(A|B).
1. Bernoulli Distribution:
Has two possible outcomes: 1 (success) with probability p, and 0 (failure) with probability q =
1 - p.
2. Binomial Distribution:
3. Poisson Distribution:
Represents the probability of a given number of events occurring in a fixed interval of time or
space if these events occur with a known average rate and independently of the time since
the last event.
Example: The number of phone calls received by a call center per hour.
4. Geometric Distribution:
Represents the probability of the number of trials needed to get the first success in a series
of independent Bernoulli trials, each with the same probability of success p.
Example: The number of coin flips needed to get the first head.
Represents the probability of the number of trials needed to get r successes in a series of
independent Bernoulli trials, each with the same probability of success p.
6. Hypergeometric Distribution:
1. Uniform Distribution:
The probability density function (PDF) is constant over the interval and zero elsewhere.
Example: A random number generator that produces numbers between 0 and 1 with equal
probability.
Symmetric, bell-shaped curve, characterized by its mean (μ) and standard deviation (σ).
3. Exponential Distribution:
Describes the time between events in a Poisson process (a process in which events occur
continuously and independently at a constant average rate).
Example: The time between customer arrivals at a store, the lifetime of a light bulb.
4. Gamma Distribution:
Often used to model waiting times or sums of exponentially distributed random variables.
5. Beta Distribution:
Probability Density Function (PDF): A function that describes the relative likelihood of the
variable taking on a given value. The area under the PDF curve over a given interval
represents the probability that the variable falls within that interval.
The probability of the variable taking on any single specific value is zero.
Sampling Distributions:
A sampling distribution is a probability distribution of a statistic obtained from a larger number of
samples drawn from a specific population. It describes the distribution of values that a statistic (like
the mean, variance, or proportion) can take across all possible samples of a fixed size from that
population.
Population: The entire group you're interested in studying (e.g., all adults in a country).
Sample: A smaller, representative subset of the population (e.g., 1000 adults from that
country).
Parameter: A numerical value that describes a characteristic of the population (e.g., the
average height of all adults in the country). Parameters are usually unknown.
Statistic: A numerical value that describes a characteristic of the sample (e.g., the average
height of the 1000 adults in the sample). Statistics are used to estimate population
parameters.
Imagine taking many different samples of the same size from the same population.
For each sample, you calculate a statistic (e.g., the sample mean).
Example:
Suppose we want to know the average height of all women in a country. We can't measure every
woman, so we take many samples of 100 women and calculate the average height for each sample.
The distribution of these sample means is the sampling distribution of the mean.
Central Limit Theorem: One of the most important theorems in statistics. It states that:
o If you take sufficiently large samples (usually n ≥ 30) from any population, the
sampling distribution of the mean will be approximately normally distributed,
regardless of the shape of the original population distribution.
o The mean of the sampling distribution will be equal to the population mean (μ).
o The standard deviation of the sampling distribution (also called the standard error)
will be equal to the population standard deviation (σ) divided by the square root of
the sample size (n): σ/√n.
The shape, center, and spread of the sampling distribution depend on:
1. Hypotheses:
Null Hypothesis (H₀): A statement about the population parameter that we assume to be
true initially. It often represents the "status quo" or no effect.
Alternative Hypothesis (H₁ or Hₐ): A statement that contradicts the null hypothesis. It
represents what we're trying to find evidence for.
Example:
2. Test Statistic:
A value calculated from the sample data that is used to evaluate the evidence against the
null hypothesis.
The choice of test statistic depends on the type of data and the hypothesis being tested.
A pre-determined threshold (usually 0.05 or 5%) that represents the probability of rejecting
the null hypothesis when it is actually true (Type I error).
4. P-value:
The probability of observing a test statistic as extreme as, or more extreme than, the one
calculated from the sample data, assuming the null hypothesis is true.
5. Decision Rule:
P-value approach: If the p-value is less than or equal to the significance level (α), we reject
the null hypothesis. Otherwise, we fail to reject the null hypothesis.
Critical value approach: Compare the test statistic to a critical value from the appropriate
distribution. If the test statistic falls in the rejection region (beyond the critical value), we
reject the null hypothesis.
2. Choose the significance level (α): Determine the acceptable probability of a Type I error.
3. Select the test statistic: Choose the appropriate statistic based on the data and hypotheses.
4. Collect sample data and calculate the test statistic: Obtain data and compute the value of
the test statistic.
5. Determine the p-value or critical value: Calculate the p-value or find the critical value from
the appropriate distribution.
6. Make a decision: Compare the p-value to α or the test statistic to the critical value and
decide whether to reject or fail to reject the null hypothesis.
Types of Errors:
Type I error (False Positive): Rejecting the null hypothesis when it is actually true. The
probability of a Type I error is α.
Type II error (False Negative): Failing to reject the null hypothesis when it is actually false.
The probability of a Type II error is denoted by β.
Example:
We conduct a clinical trial, collect data, and calculate a test statistic. If the p-value is less than 0.05,
we reject the null hypothesis and conclude that the drug is effective.
Hypothesis testing is a crucial tool in scientific research, business decision-making, and many other
fields. It provides a rigorous framework for evaluating evidence and drawing statistically sound
conclusions.
Here's a breakdown:
o Regularize the model: Priors can prevent overfitting by encouraging the model to
favor simpler solutions.
o Improve performance with limited data: When you have a small dataset, priors can
help guide the model towards more reasonable solutions.
o Incorporate expert knowledge: You can encode domain-specific knowledge into the
priors, which can lead to better model performance.
Types of Priors:
o Informative Priors: These reflect strong prior beliefs about the parameters. They can
be based on previous experiments, expert knowledge, or theoretical considerations.
o Uninformative Priors: These represent weak prior beliefs. They're often used when
there's little or no prior knowledge about the parameters. A common example is the
uniform prior, which assigns equal probability to all possible values of the parameter.
Example:
Let's say you're building a model to predict house prices. You might have a prior belief that the
relationship between house size and price is likely to be linear. You could incorporate this prior by
using a linear regression model with a prior that favors linear relationships.
Key Points:
Priors are crucial in Bayesian machine learning.
POSTERIOR:-
In machine learning, the posterior refers to the probability distribution over the model's
parameters after observing the training data. It represents the updated beliefs about these
parameters, taking into account both the prior beliefs (encoded in the prior distribution) and the
information gleaned from the data.
Key Points:
Relationship to Prior: The posterior is calculated using Bayes' theorem, which combines the
prior distribution with the likelihood function (the probability of observing the data given the
model parameters).
Practical Applications:
o Parameter Estimation: The posterior distribution can be used to estimate the most
likely values of the model parameters (e.g., using the maximum a posteriori (MAP)
estimate).
Example:
Consider a spam classification model. The prior might encode a belief that most emails are not spam.
After observing a large number of emails and their labels, the posterior distribution would reflect the
updated belief about the probability of an email being spam, taking into account the observed data.
In essence, the posterior represents the refined understanding of the model parameters gained
through the interplay of prior knowledge and observed data. It's a crucial component of Bayesian
machine learning, enabling more informed and reliable decision-making.
Likelihood:
In the context of Bayesian machine learning and Bayes' theorem, the likelihood refers to the
probability of observing the training data given specific values for the model's parameters.
Key Points:
Definition: The likelihood function tells us how likely it is to observe the actual data we have,
assuming that a particular set of model parameters is true.
Role in Bayes' Theorem: The likelihood is a crucial component of Bayes' theorem, where it's
combined with the prior probability to calculate the posterior probability.
Calculation: The likelihood is typically calculated using the probability distribution associated
with the model (e.g., Gaussian distribution for linear regression).
Example:
Imagine you're building a model to predict house prices. The model might have parameters like slope
and intercept for a linear relationship between house size and price. The likelihood function would
tell us how likely it is to observe the actual house prices in the training data given specific values for
the slope and intercept parameters.
1. Bayes Classifiers
Bayes Classifiers are probabilistic models used for classification tasks. They are based on Bayes'
Theorem, which calculates the probability of a class given the observed features. The key idea is to
find the posterior probability P(Y∣X) and assign the class with the highest probability.
Key Components:
Prior Probability P(Y): The probability of a class before observing any features.
Likelihood P(X∣Y)): The probability of observing the features given the class.
Posterior Probability (P(Y∣X): The probability of the class given the observed features.
Marginal Probability P(X): The probability of observing the features (acts as a normalizing
constant).
Bayes' Theorem:
The Bayes Optimal Classifier is the theoretical best classifier that minimizes the probability of
misclassification. It combines the predictions of all possible hypotheses (models) weighted by their
posterior probabilities.
Key Points:
It is optimal because no other classifier can achieve a lower error rate on average.
Formula:
Where:
The Naïve Bayes Classifier is a simplified version of Bayes Classifiers that assumes conditional
independence between features given the class label. This assumption makes it computationally
efficient and easy to implement.
Key Assumption:
Steps:
1. Training:
o Estimate the likelihoods P(xi∣Y) for each feature given each class.
2. Prediction:
o For a new instance X=(x1,x2,…,xn), compute the posterior probability for each class:
Multinomial Naïve Bayes: Used for discrete data (e.g., text classification).
Naïve Bayes is widely used in various domains due to its simplicity, efficiency, and effectiveness.
Some common applications include:
1. Text Classification:
Sentiment Analysis: Determine the sentiment (positive, negative, neutral) of text data.
2. Medical Diagnosis:
Predict the likelihood of a disease based on patient symptoms and test results.
3. Recommendation Systems:
4. Fraud Detection:
5. Weather Prediction:
6. Image Classification:
Classify images into categories (e.g., animals, objects) based on pixel data.
7. Customer Segmentation:
Summary
Bayes Classifiers use Bayes' Theorem to predict the class with the highest posterior
probability.
Bayes Optimal Classifier is the theoretical best classifier but is often impractical.
Naïve Bayes Classifier simplifies Bayes Classifiers by assuming feature independence, making
it efficient and widely applicable.
Applications of Naïve Bayes include text classification, medical diagnosis, recommendation
systems, fraud detection, and more.