0% found this document useful (0 votes)
40 views32 pages

Naive Bayes

The document discusses overfitting and regularization techniques. It states that overfitting occurs when a model learns noise in addition to information. Regularization adds a penalty that increases with model complexity to address overfitting. Ridge and lasso regression are regularization techniques that can be used when there are many features compared to observations. They reduce model complexity and prevent overfitting from simple linear regression. Naive Bayes classification is also discussed as using Bayes' theorem for classification tasks based on probabilistic assumptions of independence between features.

Uploaded by

harshbafna.ei20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views32 pages

Naive Bayes

The document discusses overfitting and regularization techniques. It states that overfitting occurs when a model learns noise in addition to information. Regularization adds a penalty that increases with model complexity to address overfitting. Ridge and lasso regression are regularization techniques that can be used when there are many features compared to observations. They reduce model complexity and prevent overfitting from simple linear regression. Naive Bayes classification is also discussed as using Bayes' theorem for classification tasks based on probabilistic assumptions of independence between features.

Uploaded by

harshbafna.ei20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Overfitting & Regularization

Overfitting:- Model learns information plus


noise.
• cross-validation sampling
• reducing number of features
• pruning
• Regularization:-
 Adds the penalty as complexity increases.

28/04/2021 Dr Geetishree Mishra 1


• When large number of features are there in a
dataset, compared to the number of
observations, some of the Regularization
(shrinkage models)techniques used to address
over-fitting and feature selection are:
» L2 – Ridge regression
» L1– Lasso regression
• Ridge and Lasso regression are some of the
simple techniques to reduce model complexity
and prevent over-fitting which may result
from simple linear regression.

28/04/2021 Dr Geetishree Mishra 2


Ridge Regression(L2)

Regularization parameter (lambda) penalizes all the


parameters except intercept so that model generalizes the
data and won’t overfit.
It tends to solve the multicollinearity problem through
shrinkage parameter ‘λ’.
28/04/2021 Dr Geetishree Mishra 3
Lasso regression(L1)

 Lasso (Least absolute shrinkage and selection operator)


penalizes the absolute size of the regression coefficients.
 In addition to this, it is quite capable of reducing the
variability and improving the accuracy of linear regression
models.
 Helps in dimensionality reduction and feature selection.
28/04/2021 Dr Geetishree Mishra 4
• Traditional methods like cross-validation,
stepwise regression to handle overfitting and
perform feature selection work well with a
small set of features but Ridge and Lasso
regularization techniques are a great
alternative when we are dealing with a large
set of features.

28/04/2021 Dr Geetishree Mishra 5


UNIT-III &II
Classification
A classifier is a machine learning model
that is used to discriminate different
objects based on certain features.
• Naïve Bayes
• Logistic Regression
• k- Nearest Neighbor (kNN)
• Decision Tree

28/04/2021 Dr Geetishree Mishra 6


Naïve baye’s..

Posteriori= (Likelihood X Priori)/Evidence

 A Naive Bayes classifier is a probabilistic machine learning


model that’s used for classification task.

The principle of the classifier is based on the Baye’s


theorem.
 Wide range of applications like spam filtering, sentiment
analysis, document classification etc.

28/04/2021 Dr Geetishree Mishra 7


Naïve bayes..
• Using Baye’s theorem, we can find the probability
of A happening, given that B has occurred.
• Here, B is the evidence and A is the hypothesis.
Naïve Assumptions
 Predictors/Features/Attributes are independent
of each other.
 All the predictors have equal effect on the
outcome.

28/04/2021 Dr Geetishree Mishra 8


The variable y is the class variable and X represents the
parameters/features.

28/04/2021 Dr Geetishree Mishra 9


• P(y | X1, ⋯, Xn) is called the posterior and is
the probability that an observation is class y
given the observation’s values for the n
features, X1, ⋯, Xn.
• P(X1, ...Xn | y) is called likelihood and is the
likelihood of an observation’s values for
features, X1, ...Xn , given their class, y.
• P(y) is called the prior and is our belief for the
probability of class y before looking at the
data.
• P(X1, ⋯, Xn) is called the marginal probability.
28/04/2021 Dr Geetishree Mishra 10
• For all entries in the dataset, the denominator of the
equation does not change, it remains static.

• Therefore, the denominator can be removed and a


proportionality can be introduced as the equation
below:

• For each observation, the class with the greatest posterior


numerator becomes the predicted class

• Obtain the class with maximum probability given the


predictors
28/04/2021 Dr Geetishree Mishra 11
Working steps-
Step 1- Compute the prior probabilities for
given class labels.
Step 2- Compute the Likelihood of evidence with
each attribute for each class.
Step 3- Calculate the posterior probabilities using
Bayes rule.
Step 4- Select the class which has higher
probability for given inputs.

28/04/2021 Dr Geetishree Mishra 12


Usecase..
Example-
• Let us consider a problem to predict a CAR Racing
Match (yes/no) based on parameters like
• weather conditions of cloudy scene, Temperature,
Moisture on track and turbulent.
• The historical data of some previous matches are given
in the table.
• Using the given data we need to predict the car race
match on the day of sunshine, chill, high and false
values of parameter cloudy scene, temperature,
moisture and turbulent respectively.
28/04/2021 Dr Geetishree Mishra 13
S.no Cloudy scene Temperature Moisture on track Turbulent Car Race
1 Hazy Heated High False Yes
2 Hazy Chill Normal True Yes
3 Hazy Moderate High True Yes
4 Hazy Heated Normal False Yes
5 Rainfall Moderate High False Yes
6 Rainfall Chill Normal False Yes
7 Rainfall Chill Normal True No
8 Rainfall Moderate Normal False Yes
9 Rainfall Moderate High True No
10 Sunshine Heated High False No
11 Sunshine Heated High True No
12 Sunshine Moderate High False No
13 Sunshine Chill Normal False Yes
14 Sunshine Moderate Normal True Yes

28/04/2021 Dr Geetishree Mishra 14


Let us assume x1 = Sunshine, x2 = Chill, x3 =High
and x4 = False
Step 1- Calculate the Prior Probabilities:
 P (CarRace = Yes) = 9/14 and
 P(CarRace = No) = 5/14
Step 2- Compute the Likelihood of evidence that goes into
denominators:
 P (Yes) = P(Sunshine ∩ Yes) / P(Yes) = (2/14)/(9/14) = 2/ 9
P (Yes) = P(Chill ∩ Yes) / P(Yes) =(3/14)/(9/14) = 3/ 9
P (Yes) = P(High ∩ Yes)/ P(Yes) = (3/14)/(9/14) = 3/9
P (Yes) =P (False ∩ Yes)/ P(Yes) = (6/14)/(9/14) = 6/9
 P (No) = P(Sunshine ∩ No)/ P(No) =(3/14)/(5/14) =3/ 5
P (No) = P(Chill ∩ No) /P(No) =(1/14)/(5/14) =1/5
P (No) = P(High ∩ No)/P (No) =(4/14)/(5/14) =4/5
P (No) = P(False ∩ No)/ P (No) =(2/14)/(5/14) =2/ 5

28/04/2021 Dr Geetishree Mishra 15


Step 3- Calculate the probability using baye’s rule:
• P ( Yes| Sunshine, Chill, High, False) =
[P(Yes| Sunshine, Chill, High, False)*P(Yes)]/
P(Sunshine, Chill, High, False)
• P ( No| Sunshine, Chill, High, False) =
[P(No| Sunshine, Chill, High, False)*P(No)]/
P(Sunshine, Chill, High, False)
• Ignoring the denominator term as it is common in
both the equations and applying naïve assumption
of independent features, The Probabilities are:

28/04/2021 Dr Geetishree Mishra 16


• P (Sunshine, Chill, High, False) =
P (Yes) * P (Yes) * P (Yes) * P (Yes) * P(Yes)
and
• P (Sunshine, Chill, High, False) =
P (No) * P (No) * P (No) * P (No) * P(No)
• Substituting the values:
P (Sunshine, Chill, High, False)
=(2/9)(3/9)(3/9)(6/9)(9/14) = 0.0105
and
P (Sunshine, Chill, High, False)
=(3/5)(1/5)(4/5)(2/5)(5/14)= 0.0137
28/04/2021 Dr Geetishree Mishra 17
Step 4- Finally the probabilities are:

P ( Yes| Sunshine, Chill, High, False) =


0.0105/(0.0105+0.0137) =0.44
P ( No| Sunshine, Chill, High, False) =
0.0137/(0.0105+0.0137) =0.56
• From the probability values it is clear that if the
parameter values are as Cloudy Scene = Sunshine,
Temperature = Chill, Moisture on track = High and
Turbulent = False, then there is 44% chance to
organize a Car race match.

28/04/2021 Dr Geetishree Mishra 18


How to improve the efficiency of naïve baye’s
model-
• If continuous features are not normally
distributed in the dataset then use any
transformation method to convert them as
normal distribution.
• Drop the highly correlated.
Evaluation of Naïve Model-
• This is a classification model, so all the
classification accuracy metrics are used to
evaluate the performance of models.
28/04/2021 Dr Geetishree Mishra 19
Pros and Cons of Naïve Baye’s Model-
Pros
• Naïve baye’s model is simple to implement and fast
in processing.
• Requires few examples in the train set to work with.
• Perform well with noisy data and missing values.
Cons
• Perform poorly if the dataset contains more
continuous input features.
• Predictions are based on the assumption of
independent features which is almost impossible in
real life scenarios.
• Sometimes the estimated probabilities are less
reliable.
28/04/2021 Dr Geetishree Mishra 20
• In spite of over-simplified assumptions, naive Baye’s classifiers
have worked quite well in many real-world situations, such as:
document classification and spam filtering.
• They require a small amount of training data to estimate the
necessary parameters.
• Naive Baye’s learners and classifiers can be extremely fast
compared to more sophisticated methods.
• The different naive Bayes classifiers differ mainly by the
assumptions they make regarding the distribution of P(xi∣y).
• The decoupling of the class conditional feature distributions
means that each distribution can be independently estimated
as a one dimensional distribution. This in turn helps to
eliminate problems stemming from the curse of
dimensionality.
• On the flip side, although naive Bayes is known as a decent
classifier, it is known to be a bad estimator, so the probability
outputs from predict_proba() are not to be taken too
seriously.

28/04/2021 Dr Geetishree Mishra 21


1. Bernoulli Naïve Bayes:
– for binary features
2. Multinomial Naïve Bayes:
– for integer features
3. Gaussian Naïve Bayes:
– for continuous features
4. Multi-class Naïve Bayes:
– for classification problems with > 2 classes
– event model could be any of Bernoulli,
Gaussian, Multinomial, depending on
features.
28/04/2021 Dr Geetishree Mishra 22
Bernoulli Naïve Bayes..
• Bernoulli NB implements the naive Bayes
training and classification algorithms for data
set in which there may be multiple features
but each one is assumed to be a binary-valued
variable.
• Therefore, this class requires samples to be
represented as binary-valued feature vectors
• If handed any other kind of data,
a BernoulliNB instance may binarize its input.

28/04/2021 Dr Geetishree Mishra 23


Bernoulli Naïve Bayes:
Data: Binary feature vectors, Binary labels

28/04/2021 Dr Geetishree Mishra 24


Gaussian NB..

Gaussian Naive Bayes assumes that p(xi|y) is given


by a Normal distribution.
28/04/2021 Dr Geetishree Mishra 25
Gaussian NB..

The parameters ’σy’ and ’μy’ are estimated using maximum likelihood.

28/04/2021 Dr Geetishree Mishra 26


Evaluation Matrics
• Jaccard index.
• Confusion Matrix
• F-1 Score
• Log loss

28/04/2021 Dr Geetishree Mishra 27


Jaccard Index..
•Given predicted values as (y^ )and actual values as y, the
Jaccard index can be defined as :

Exa: Predicted= [0,0,0,0,0,1,1,1,1,1] and actual= [1,1,0,0,0,1,1,1,1,1]

J(y, ^y)= 8/(10+10-8)=0.66

Idea: Higher the similarity between two groups, higher the “jaccard” index.

28/04/2021 Dr Geetishree Mishra 28


Confusion Matrix..
• The confusion matrix is used to describe the
performance of a classification model on a set
of test data for which true values are known.
• For classification problem with a class output,
the confusion matrix gives the counts of
correct and erroneous predictions.

28/04/2021 Dr Geetishree Mishra 29


True positive(TP).: This shows that a model correctly predicted Positive cases
as Positive. eg an illness is diagnosed as present and truly is present.
False positive(FP): This shows that a model incorrectly predicted Negative cases
as Positive.eg an illness is diagnosed as present and but in real, is absent. (Type I error)
False Negative:(FN) This shows that an incorrectly model predicted Positive cases
as Negative.eg an illness is diagnosed as absent and but in real, is present. (Type II error)
True Negative(TN): This shows that a model correctly predicted Negative cases
as Negative. eg an illness is diagnosed as absent and truly is absent.

28/04/2021 Dr Geetishree Mishra 30


Parameters..
• Classification Error Rate : sum of Type 1 (FP) and Type 2 (FN)
Errors(in percentage). Accuracy is 1-(error rate)
• Sensitivity (also called Recall or True Positive Rate): proportion of
Total Positives that were correctly identified: TP/TP+FN
• Specificity (also called True Negative Rate): proportion of Total
negatives that were correctly identified: TN/TN+FP
• Precision = TP/TP+FP:
*how many were correctly predicted? The answer to
this question should be as high as possible.
• Recall =TP/TP+FN
*This is the true positive rate that is if it predicts positive then how
often does this take place?

28/04/2021 Dr Geetishree Mishra 31


F1-Score..
• The F1 score is calculated based on the precision
and recall of each class. It is the weighted average
of the Precision and the recall scores.
• The F1 score reaches its perfect value at one and
worst at 0.It is a very good way to show that a
classifier has a good recall and precision values.

• F1-Score = 2* [(precision * recall)/(precision +


recall)]
28/04/2021 Dr Geetishree Mishra 32

You might also like