BSC ML CH2
BSC ML CH2
Unit 2
BSc (Data Science)
*few figures/content have been prepared/referred from internet source/books Unit 2-Lecture 6-15 1
Unit 2
1) Prior and posterior probabilities
2) Naive Bayesian algorithm
3) Laplacian correction
4) Logistic Regression: The Logistic model
5) Estimating the regression coefficients
6) Making predictions
7) Multiple logistic regression
Posterior Probability
A posterior probability is the probability of assigning observations to groups given
the data. The posterior is a probability distribution representing your
uncertainty over θ after you have sampled data – denoted π(θ|X). It is a
conditional distribution because it conditions on the observed data.
Suppose now a subject has been positive for HIV. It is known that specificity of the test is
95%, and sensitivity of the test is 99%.
What is the probability that the subject is HIV-infected? In other words, what is the
conditional probability that a subject is HIV-infected if he/she has tested positive?
The following table summarizes calculations. (For the sake of simplicity you may consider
the fractions (probabilities) as proportions of the general population.)
• Thus, the average proportion of positive tests overall is 0.0594, and the proportion of actually
infected among them is 0.0099/0.0594 or 0.167 = 16.7%. So, the posterior (i.e. after the test has
been carried out and turns out to be positive) probability that the subject is really HIV-infected is
0.167.
• The difference between prior and posterior probabilities characterizes the information we have
gotten from the experiment or measurement. In this example the probability changed from 0.01
(prior) to 0.167 (posterior)
Where,
P(h): the probability of hypothesis h being true (regardless of the data). This is known as the prior
probability of h.
P(D): the probability of the data (regardless of the hypothesis). This is known as the prior
probability.
P(h|D): the probability of hypothesis h given the data D. This is known as posterior probability.
P(D|h): the probability of data d given that the hypothesis h was true. This is known as posterior
probability.
• Basic assumptions
• Use all the attributes
• Attributes are assumed to be:
• equally important: all attributes have the same relevance to the classification task.
• statistically independent (given the class value): knowledge about the value of a particular
attribute doesn't tell us anything about the value of another attribute (if the class is
known).
• Why is it called Naïve Bayes?
• The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
• Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of
the occurrence of other features. Such as if the fruit is identified on the basis of color, shape, and taste,
then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.
• Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem
Properties
1. The Naive Bayes Algorithm is one of the popular classification machine learning algorithms
that helps to classify the data based upon the conditional probability values computation.
2. It implements the Bayes theorem for the computation and used class levels represented as feature
values or vectors of predictors for classification.
3. Naive Bayes Algorithm is a fast algorithm for classification problems.
4. This algorithm is a good fit for real-time prediction, multi-class prediction, recommendation
system, text classification, and sentiment analysis use cases.
5. Naive Bayes Algorithm can be built using Gaussian, Multinomial and Bernoulli distribution.
6. This algorithm is scalable and easy to implement for a large data set.
7. It helps to calculate the posterior probability P(c|x) using the prior probability of class P(c), the prior
probability of predictor P(x), and the probability of predictor given class, also called as
likelihood P(x|c).
13
Steps
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
• Solution: To solve this, first consider the dataset:
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No Weather No Yes
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
• Likelihood table weather condition:
Applying Bayes'theorem:
PYes|Sunny)= PSunny|Yes)*PYes/PSunny
• Gaussian: The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values are
sampled from the Gaussian distribution.
• Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial distributed. It
is primarily used for document classification problems, it means a particular document belongs to
which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
• Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor variables
are the independent Booleans variables. Such as if a particular word is present or not in a document.
This model is also famous for document classification tasks.
data should follow the rule of Gaussian
distribution or normal distribution, used
for continuous data
it goes through all the possibilities, which is very slow and time-consuming.
Types of the
Naive Bayes
Model
Text classification,
Multinomial Naive
Discrete document
Bayes
categorization
Spam detection,
Bernoulli Naive Bayes Binary
sentiment analysis
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a
class of datasets.
• It can be used for Binary classification.
• It performs well in Multi-class predictions as compared to the other
Algorithms.
• It is the most popular choice for text classification problems.
Disadvantages of Naive Bayes:
• Naive Bayes assumes that all predictors (or features) are
independent, rarely happening in real life. This limits the applicability
of this algorithm in real-world use cases.
• This algorithm faces the ‘zero-frequency problem’ where it assigns
zero probability to a categorical variable whose category in the test
data set wasn’t available in the training dataset. It would be best if
you used a smoothing technique to overcome this issue.
• Its estimations can be wrong in some cases, so you shouldn’t take its
probability outputs very seriously.
Applications
of Naïve Bayes Classifier:
Example 1: Probabilities of weather data
• Answer : Play=No
Example
• Using this data, we
have to identify the
species of an entity
with the following
attributes.
• X={Color=Green,
Legs=2,
Height=Tall,
Smelly=No}
• To predict the class label for the above attribute set, we will first
calculate the probability of the species being M or H in total.
• P(Species=M)=4/8=0.5
• P(Species=H)=4/8=0.5
• Next, we will calculate the conditional probability of each attribute
value for each class label.
• P(Color=White/Species=M)=2/4=0.5 P(Height=Tall/Species=M)=3/4=0.75
P(Height=Tall/Species=H)=2/4=0.5
• P(Color=White/Species=H)=¾=0.75 P(Height=Short/Species=M)=1/4=0.25
• P(Color=Green/Species=M)=2/4=0.5 P(Height=Short/Species=H)=2/4=0.5
• P(Color=Green/Species=H)=¼=0.25
P(Legs=2/Species=M)=1/4=0.25 P(Smelly=Yes/Species=M)=3/4=0.75
P(Legs=2/Species=H)=4/4=1 P(Smelly=Yes/Species=H)=1/4=0.25
P(Legs=3/Species=M)=3/4=0.75 P(Smelly=No/Species=M)=1/4=0.25
P(Legs=3/Species=H)=0/4=0 P(Smelly=No/Species=H)=3/4=0.75
• Now that we have calculated the conditional probabilities, we will use them to calculate the probability of the
new attribute set belonging to a single class.
• P(M/X)=P(Species=M)*P(Color=Green/Species=M)*P(Legs=2/Species=M)*P(Height=Tall/Species=M)*P(Smelly=
No/Species=M)
• =0.5*0.5*0.25*0.75*0.25
• =0.0117
• Similarly, the probability of X belonging to Species H will be calculated as
follows.
• P(H/X)=P(Species=H)*P(Color=Green/Species=H)*P(Legs=2/Species=H)*P(
Height=Tall/Species=H)*P(Smelly=No/Species=H)
• =0.5*0.25*1*0.5*0.75
• =0.0468
• So, the probability of X belonging to Species M is 0.0117 and that to
Species H is 0.0468. Hence, we will assign the entity X with attributes
{Color=Green, Legs=2, Height=Tall, Smelly=No} to species H.
• P(A|B) = (P(B|A) * P(A) )/ P(B)
• Mango:
• P(X | Mango) = P(Yellow | Mango) * P(Sweet | Mango) * P(Long | Mango)
• a)P(Yellow | Mango) = (P(Yellow | Mango) * P(Yellow) )/ P (Mango)
• = ((350/800) * (800/1200)) / (650/1200)
• P(Yellow | Mango)= 0.53 →1
So, in the case that we donʼt have a particular ingredient in our training
set, the posterior probability comes out to 1 / N + k instead of zero.
Using Laplace smoothing, we can represent P(xʼ|positive) as,
https://colab.research.google.com/drive/1DX8wOrqeWyNCOBWlN9dQzJL_IccX8wiS?authuser=2
Email spam
detection
Logistic Regression: The Logistic model
1. Logistic regression becomes a classification technique only when a decision threshold is brought into
the picture.
2. The setting of the threshold value is a very important aspect of Logistic regression and is dependent
on the classification problem itself.
3. The decision for the value of the threshold value is majorly affected by the values of precision and
recall.
4. Ideally, we want both precision and recall to be 1, but this seldom is the case.
5. Logistic Regression is a “Supervised machine learning” algorithm that can be used to model the
probability of a certain class or event.
6. It is used when the data is linearly separable and the outcome is binary in nature.
7. That means Logistic regression is usually used for Binary classification problems.
Sigmoid function maps any real value into another value between 0 and 1. In machine
learning, we use sigmoid to map predictions to probabilities.
When to use Logistic Regression?
• Logistic Regression is used when the input needs to be separated into
“two regions” by a linear boundary.
• The data points are separated using a linear line as shown:
• Based on the number of categories, Logistic regression can be classified as:
• binomial: target variable can have only 2 possible types: “0” or “1” which
may represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc.
• multinomial: target variable can have 3 or more possible types which are
not ordered(i.e. types have no quantitative significance) like “disease A” vs
“disease B” vs “disease C”.
• ordinal: it deals with target variables with ordered categories. For example,
a test score can be categorized as:“very poor”, “poor”, “good”, “very good”.
Here, each category can be given a score like 0, 1, 2, 3.
Interpretation of Regression Coefficients
Odds of success
• Odds (𝜃) = Probability of an event happening / Probability of an event
not happening
𝜃=p/1-p
• The values of odds range from zero to ∞ and the values of probability
lies between zero and one.
• Consider the equation of a straight line:
𝑦 = 𝛽0 + 𝛽1* 𝑥
• to transform the model from linear regression to logistic regression using the
logistic function.
• Now to predict the odds of success, we use the following formula:
y_pred = logreg.predict(X_test)
# import the metrics class
from sklearn import metrics
array([[115, 8],
[ 30, 39]])
Example
• The dataset of pass/fail in an exam for 5 students is given in the table below. If we use Logistic
Regression as the classifier and assume the model suggested by the optimizer will become the
following for Odds of passing a course:
• log(Odds)=−64+2×hours
• 1) How to calculate the probability of Pass for the student who studied 33 hours?
• 2) At least how many hours the student should study that makes sure will pass the
course with the probability of more than 95%?