The document discusses overfitting and regularization techniques. It states that overfitting occurs when a model learns noise in addition to information. Regularization adds a penalty that increases with model complexity to address overfitting. Ridge and lasso regression are regularization techniques that can be used when there are many features compared to observations. They reduce model complexity and prevent overfitting from simple linear regression. Naive Bayes classification is also discussed as using Bayes' theorem for classification tasks based on probabilistic assumptions of independence between features.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
40 views32 pages
Naive Bayes
The document discusses overfitting and regularization techniques. It states that overfitting occurs when a model learns noise in addition to information. Regularization adds a penalty that increases with model complexity to address overfitting. Ridge and lasso regression are regularization techniques that can be used when there are many features compared to observations. They reduce model complexity and prevent overfitting from simple linear regression. Naive Bayes classification is also discussed as using Bayes' theorem for classification tasks based on probabilistic assumptions of independence between features.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32
Overfitting & Regularization
Overfitting:- Model learns information plus
noise. • cross-validation sampling • reducing number of features • pruning • Regularization:- Adds the penalty as complexity increases.
28/04/2021 Dr Geetishree Mishra 1
• When large number of features are there in a dataset, compared to the number of observations, some of the Regularization (shrinkage models)techniques used to address over-fitting and feature selection are: » L2 – Ridge regression » L1– Lasso regression • Ridge and Lasso regression are some of the simple techniques to reduce model complexity and prevent over-fitting which may result from simple linear regression.
28/04/2021 Dr Geetishree Mishra 2
Ridge Regression(L2)
Regularization parameter (lambda) penalizes all the
parameters except intercept so that model generalizes the data and won’t overfit. It tends to solve the multicollinearity problem through shrinkage parameter ‘λ’. 28/04/2021 Dr Geetishree Mishra 3 Lasso regression(L1)
Lasso (Least absolute shrinkage and selection operator)
penalizes the absolute size of the regression coefficients. In addition to this, it is quite capable of reducing the variability and improving the accuracy of linear regression models. Helps in dimensionality reduction and feature selection. 28/04/2021 Dr Geetishree Mishra 4 • Traditional methods like cross-validation, stepwise regression to handle overfitting and perform feature selection work well with a small set of features but Ridge and Lasso regularization techniques are a great alternative when we are dealing with a large set of features.
28/04/2021 Dr Geetishree Mishra 5
UNIT-III &II Classification A classifier is a machine learning model that is used to discriminate different objects based on certain features. • Naïve Bayes • Logistic Regression • k- Nearest Neighbor (kNN) • Decision Tree
28/04/2021 Dr Geetishree Mishra 6
Naïve baye’s..
Posteriori= (Likelihood X Priori)/Evidence
A Naive Bayes classifier is a probabilistic machine learning
model that’s used for classification task.
The principle of the classifier is based on the Baye’s
theorem. Wide range of applications like spam filtering, sentiment analysis, document classification etc.
28/04/2021 Dr Geetishree Mishra 7
Naïve bayes.. • Using Baye’s theorem, we can find the probability of A happening, given that B has occurred. • Here, B is the evidence and A is the hypothesis. Naïve Assumptions Predictors/Features/Attributes are independent of each other. All the predictors have equal effect on the outcome.
28/04/2021 Dr Geetishree Mishra 8
The variable y is the class variable and X represents the parameters/features.
28/04/2021 Dr Geetishree Mishra 9
• P(y | X1, ⋯, Xn) is called the posterior and is the probability that an observation is class y given the observation’s values for the n features, X1, ⋯, Xn. • P(X1, ...Xn | y) is called likelihood and is the likelihood of an observation’s values for features, X1, ...Xn , given their class, y. • P(y) is called the prior and is our belief for the probability of class y before looking at the data. • P(X1, ⋯, Xn) is called the marginal probability. 28/04/2021 Dr Geetishree Mishra 10 • For all entries in the dataset, the denominator of the equation does not change, it remains static.
• Therefore, the denominator can be removed and a
proportionality can be introduced as the equation below:
• For each observation, the class with the greatest posterior
numerator becomes the predicted class
• Obtain the class with maximum probability given the
predictors 28/04/2021 Dr Geetishree Mishra 11 Working steps- Step 1- Compute the prior probabilities for given class labels. Step 2- Compute the Likelihood of evidence with each attribute for each class. Step 3- Calculate the posterior probabilities using Bayes rule. Step 4- Select the class which has higher probability for given inputs.
28/04/2021 Dr Geetishree Mishra 12
Usecase.. Example- • Let us consider a problem to predict a CAR Racing Match (yes/no) based on parameters like • weather conditions of cloudy scene, Temperature, Moisture on track and turbulent. • The historical data of some previous matches are given in the table. • Using the given data we need to predict the car race match on the day of sunshine, chill, high and false values of parameter cloudy scene, temperature, moisture and turbulent respectively. 28/04/2021 Dr Geetishree Mishra 13 S.no Cloudy scene Temperature Moisture on track Turbulent Car Race 1 Hazy Heated High False Yes 2 Hazy Chill Normal True Yes 3 Hazy Moderate High True Yes 4 Hazy Heated Normal False Yes 5 Rainfall Moderate High False Yes 6 Rainfall Chill Normal False Yes 7 Rainfall Chill Normal True No 8 Rainfall Moderate Normal False Yes 9 Rainfall Moderate High True No 10 Sunshine Heated High False No 11 Sunshine Heated High True No 12 Sunshine Moderate High False No 13 Sunshine Chill Normal False Yes 14 Sunshine Moderate Normal True Yes
28/04/2021 Dr Geetishree Mishra 14
Let us assume x1 = Sunshine, x2 = Chill, x3 =High and x4 = False Step 1- Calculate the Prior Probabilities: P (CarRace = Yes) = 9/14 and P(CarRace = No) = 5/14 Step 2- Compute the Likelihood of evidence that goes into denominators: P (Yes) = P(Sunshine ∩ Yes) / P(Yes) = (2/14)/(9/14) = 2/ 9 P (Yes) = P(Chill ∩ Yes) / P(Yes) =(3/14)/(9/14) = 3/ 9 P (Yes) = P(High ∩ Yes)/ P(Yes) = (3/14)/(9/14) = 3/9 P (Yes) =P (False ∩ Yes)/ P(Yes) = (6/14)/(9/14) = 6/9 P (No) = P(Sunshine ∩ No)/ P(No) =(3/14)/(5/14) =3/ 5 P (No) = P(Chill ∩ No) /P(No) =(1/14)/(5/14) =1/5 P (No) = P(High ∩ No)/P (No) =(4/14)/(5/14) =4/5 P (No) = P(False ∩ No)/ P (No) =(2/14)/(5/14) =2/ 5
28/04/2021 Dr Geetishree Mishra 15
Step 3- Calculate the probability using baye’s rule: • P ( Yes| Sunshine, Chill, High, False) = [P(Yes| Sunshine, Chill, High, False)*P(Yes)]/ P(Sunshine, Chill, High, False) • P ( No| Sunshine, Chill, High, False) = [P(No| Sunshine, Chill, High, False)*P(No)]/ P(Sunshine, Chill, High, False) • Ignoring the denominator term as it is common in both the equations and applying naïve assumption of independent features, The Probabilities are:
28/04/2021 Dr Geetishree Mishra 16
• P (Sunshine, Chill, High, False) = P (Yes) * P (Yes) * P (Yes) * P (Yes) * P(Yes) and • P (Sunshine, Chill, High, False) = P (No) * P (No) * P (No) * P (No) * P(No) • Substituting the values: P (Sunshine, Chill, High, False) =(2/9)(3/9)(3/9)(6/9)(9/14) = 0.0105 and P (Sunshine, Chill, High, False) =(3/5)(1/5)(4/5)(2/5)(5/14)= 0.0137 28/04/2021 Dr Geetishree Mishra 17 Step 4- Finally the probabilities are:
P ( Yes| Sunshine, Chill, High, False) =
0.0105/(0.0105+0.0137) =0.44 P ( No| Sunshine, Chill, High, False) = 0.0137/(0.0105+0.0137) =0.56 • From the probability values it is clear that if the parameter values are as Cloudy Scene = Sunshine, Temperature = Chill, Moisture on track = High and Turbulent = False, then there is 44% chance to organize a Car race match.
28/04/2021 Dr Geetishree Mishra 18
How to improve the efficiency of naïve baye’s model- • If continuous features are not normally distributed in the dataset then use any transformation method to convert them as normal distribution. • Drop the highly correlated. Evaluation of Naïve Model- • This is a classification model, so all the classification accuracy metrics are used to evaluate the performance of models. 28/04/2021 Dr Geetishree Mishra 19 Pros and Cons of Naïve Baye’s Model- Pros • Naïve baye’s model is simple to implement and fast in processing. • Requires few examples in the train set to work with. • Perform well with noisy data and missing values. Cons • Perform poorly if the dataset contains more continuous input features. • Predictions are based on the assumption of independent features which is almost impossible in real life scenarios. • Sometimes the estimated probabilities are less reliable. 28/04/2021 Dr Geetishree Mishra 20 • In spite of over-simplified assumptions, naive Baye’s classifiers have worked quite well in many real-world situations, such as: document classification and spam filtering. • They require a small amount of training data to estimate the necessary parameters. • Naive Baye’s learners and classifiers can be extremely fast compared to more sophisticated methods. • The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of P(xi∣y). • The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to eliminate problems stemming from the curse of dimensionality. • On the flip side, although naive Bayes is known as a decent classifier, it is known to be a bad estimator, so the probability outputs from predict_proba() are not to be taken too seriously.
28/04/2021 Dr Geetishree Mishra 21
1. Bernoulli Naïve Bayes: – for binary features 2. Multinomial Naïve Bayes: – for integer features 3. Gaussian Naïve Bayes: – for continuous features 4. Multi-class Naïve Bayes: – for classification problems with > 2 classes – event model could be any of Bernoulli, Gaussian, Multinomial, depending on features. 28/04/2021 Dr Geetishree Mishra 22 Bernoulli Naïve Bayes.. • Bernoulli NB implements the naive Bayes training and classification algorithms for data set in which there may be multiple features but each one is assumed to be a binary-valued variable. • Therefore, this class requires samples to be represented as binary-valued feature vectors • If handed any other kind of data, a BernoulliNB instance may binarize its input.
Jaccard Index.. •Given predicted values as (y^ )and actual values as y, the Jaccard index can be defined as :
Exa: Predicted= [0,0,0,0,0,1,1,1,1,1] and actual= [1,1,0,0,0,1,1,1,1,1]
J(y, ^y)= 8/(10+10-8)=0.66
Idea: Higher the similarity between two groups, higher the “jaccard” index.
28/04/2021 Dr Geetishree Mishra 28
Confusion Matrix.. • The confusion matrix is used to describe the performance of a classification model on a set of test data for which true values are known. • For classification problem with a class output, the confusion matrix gives the counts of correct and erroneous predictions.
28/04/2021 Dr Geetishree Mishra 29
True positive(TP).: This shows that a model correctly predicted Positive cases as Positive. eg an illness is diagnosed as present and truly is present. False positive(FP): This shows that a model incorrectly predicted Negative cases as Positive.eg an illness is diagnosed as present and but in real, is absent. (Type I error) False Negative:(FN) This shows that an incorrectly model predicted Positive cases as Negative.eg an illness is diagnosed as absent and but in real, is present. (Type II error) True Negative(TN): This shows that a model correctly predicted Negative cases as Negative. eg an illness is diagnosed as absent and truly is absent.
28/04/2021 Dr Geetishree Mishra 30
Parameters.. • Classification Error Rate : sum of Type 1 (FP) and Type 2 (FN) Errors(in percentage). Accuracy is 1-(error rate) • Sensitivity (also called Recall or True Positive Rate): proportion of Total Positives that were correctly identified: TP/TP+FN • Specificity (also called True Negative Rate): proportion of Total negatives that were correctly identified: TN/TN+FP • Precision = TP/TP+FP: *how many were correctly predicted? The answer to this question should be as high as possible. • Recall =TP/TP+FN *This is the true positive rate that is if it predicts positive then how often does this take place?
28/04/2021 Dr Geetishree Mishra 31
F1-Score.. • The F1 score is calculated based on the precision and recall of each class. It is the weighted average of the Precision and the recall scores. • The F1 score reaches its perfect value at one and worst at 0.It is a very good way to show that a classifier has a good recall and precision values.