0% found this document useful (0 votes)
9 views54 pages

Unit IV

The document discusses various types of data used in machine learning, including training, validation, and test data, and emphasizes the importance of data quality, quantity, and diversity. It outlines different validation techniques such as hold-out, k-fold, and stratified k-fold cross-validation, along with their advantages and disadvantages. Additionally, it introduces bootstrapping as a resampling method for improving model accuracy and mentions Support Vector Machines as a supervised learning algorithm.

Uploaded by

alphawin88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views54 pages

Unit IV

The document discusses various types of data used in machine learning, including training, validation, and test data, and emphasizes the importance of data quality, quantity, and diversity. It outlines different validation techniques such as hold-out, k-fold, and stratified k-fold cross-validation, along with their advantages and disadvantages. Additionally, it introduces bootstrapping as a resampling method for improving model accuracy and mentions Support Vector Machines as a supervised learning algorithm.

Uploaded by

alphawin88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

NEERAJ KHARYA, DEPARTMENT OF

COMPUTER APPLICATIONS, BIT DURG 1


Data and DataSet
 Training data. This type of data builds up the machine learning algorithm.
The data scientist feeds the algorithm input data, which corresponds to an
expected output. The model evaluates the data repeatedly to learn more
about the data’s behavior and then adjusts itself to serve its intended
purpose.

 Validation data. During training, validation data infuses new data into the
model that it hasn’t evaluated before. Validation data provides the first test
against unseen data, allowing data scientists to evaluate how well the model
makes predictions based on the new data. Not all data scientists use
validation data, but it can provide some helpful information to optimize
hyper parameters, which influence how the model assesses data.

 Test data. After the model is built, testing data once again validates that it
can make accurate predictions. If training and validation data include labels
to monitor performance metrics of the model, the testing data should be
unlabeled. Test data provides a final, real-world check of an unseen dataset
to confirm that the ML algorithm was trained effectively.
NEERAJ KHARYA, DEPARTMENT OF
COMPUTER APPLICATIONS, BIT DURG 2
Data and DataSet

There is some semantic ambiguity between validation data and testing data. Some
organizations call testing datasets “validation datasets.” Ultimately, if there are three
datasets to tune and check ML algorithms, validation data typically helps tune the
algorithm and testing data provides the final assessment.

Random noise (i.e. data points that make it difficult to see a


pattern), low frequency of a certain categorical variable, low
frequency of the target category (if target variable is
categorical) and incorrect numeric values etc. are just some of the ways
data can mess up a model NEERAJ KHARYA, DEPARTMENT OF
COMPUTER APPLICATIONS, BIT DURG 3
Data and DataSet
 An ML algorithm is only as good as its training data — as the saying goes,
“garbage in, garbage out." Effective ML training data is built upon three key
components:
 Quantity. A robust ML algorithm needs lots of training data to
properly learn how to interact with users and behave within the
application
 Quality. Volume alone will only take your ML algorithm so far. The
quality of the data is just as important. This means collecting real-
world data, such as voice utterances, images, videos, documents,
sounds and other forms of input on which your algorithm might rely.
 Diversity. The third piece of the pie is diversity of data, which is
essential to eliminate the dreaded problem of AI bias, where the
application works better for a certain segment of the population
than others. With AI bias, the ML algorithm delivers results that can
be seen as prejudiced against a certain gender, race, age group,
language or culture, depending on how it manifests. a
NEERAJ KHARYA, DEPARTMENT OF
COMPUTER APPLICATIONS, BIT DURG 4
Validation
 Developing the machine learning model is not enough to rely
on its predictions, we need to check the accuracy and validate
the same to ensure the precision of results given by the model
and make it usable in real-life applications.
 Choosing the right validation method is also especially
important to ensure the accuracy and biases of the validation
process.
 In Machine Learning, several models are put to use to make
algorithms work and support the programming of artificial
intelligence.
 Machine learning models are not always stable and we have to
evaluate the stability of the machine learning model. That is
where Cross Validation comes into the picture.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 5
Validation

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 6
Validation

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 7
Types of Validation Technique
 Validation techniques in machine learning are used to get the error
rate of the ML model, which can be considered as close to the true
error rate of the population.
 If the data volume is large enough to be representative of the
population, you may not need the validation techniques. However, in
real-world scenarios, we work with samples of data that may not be
a true representative of the population. This is where validation
techniques come into the picture.
 Different validation techniques:
◦ Non Exhaustive Technique
 Hold-out
 K-fold cross-validation
 Stratified k-fold cross validation
◦ Exhaustive Technique
 LOOCV
 Leave P out CV
 Bootstrapping

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 8
Hold-out Validation Method
 In this method, we randomly divide our data into two: Training
and Test/Validation set i.e. a hold-out set. We then train the
model on the training dataset and evaluate the model on the
Test/Validation dataset.
 The model evaluation techniques used on the validation
dataset to compute the error depends on the kind of problem
we are working with such as MSE being used for Regression
problems while various metrics providing with the
misclassification rate helping in finding the error for
classification problems.
 Typically the training dataset is bigger than the hold-out
dataset. Typical ratios used for splitting the data set include
60:40, 80:20 etc
 This method is only used when we only have one model to
evaluate and no hyper-parameters to tune.
NEERAJ KHARYA, DEPARTMENT OF
COMPUTER APPLICATIONS, BIT DURG 9
Hold-out Validation Method

•The limitation of such a method is that the error found in the test
dataset can highly depend on the observations included in the
train and test dataset.
• Also if the train or test dataset are not able to represent the
actual complete data then the results from the test sets can be
skewed.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 10
Hold-out Validation Method
 This method is not effective for comparing multiple models and
tuning their hyper parameters which leads us to another very
popular form of the hold-out method which includes the
splitting of data into not two, but three separate sets.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 11
Hold-out Validation Method
 When we tune the hyper-parameters based on the validation
set, we end up slightly over fitting our model based on the
validation set.
 The accuracy we receive from the validation set is not
considered final and another hold-out dataset which is the test
dataset is used to evaluate the final selected model and the
error found here is considered as the generalization error.
 The Holdout method, is not enough and we need a more
advanced validation technique that can be more unbiased and
can save the model from over fitting and such a technique is
k-fold cross validation

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 12
k Fold Validation Method
 In this instance, the dataset is broken into k number of folds
wherein one fold will be used as the test set and the rest will
be used as the training dataset and this will be
repeated n number of times as specified by the user.
 In a regression the average of the results (e.g. RMSE, R-Squared,
etc.) will be used as the final result.
 In k-fold cross-validation, we make an assumption that all
observations in the dataset are nicely distributed in a way that
the data are not biased.
 We randomly divide the dataset into k equal sized parts where
we leave out a part k and fit the model on the other combined
k-1 parts.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 13
k Fold Validation Method
 The model is evaluated on the left over k part and this process
is repeated k times so that each part is used as testing set. The
results from each fold are then combined and averaged to
come up with the final error.
 The advantage is that entire data is used for training and
testing. The error rate of the model is average of the error rate
of each iteration. This technique can also be called a form the
repeated hold-out method.
 The procedure has a single parameter called k that refers to
the number of groups that a given data sample is to be split
into. As such, the procedure is often called k-fold cross-
validation.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 14
k Fold Validation Method

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 15
k fold Validation Method
Advantage
 Checking Model Generalization: Cross-validation gives the idea
about how the model will generalize to an unknown dataset
Checking Model Performance: Cross-validation helps to
determine a more accurate estimate of model prediction
performance
 Checking Overfitting: Cross-validation can be used for checking
whether the model has been over fitted.
 Hyper parameter tuning: Cross-validation can be used for
selection of the best set of hyperparameters.
Disadvantage
 Higher Training Time: with cross-validation, we need to train the
model on multiple training sets.
 Expensive Computation: Cross-validation is computationally very
expensive as we need to train on multiple training sets.
NEERAJ KHARYA, DEPARTMENT OF
COMPUTER APPLICATIONS, BIT DURG 16
Random Vs Stratified Sampling
 Suppose you want to take a survey and decided to call 1000
people from a particular state, If you pick either 1000 male
completely or 1000 female completely or 900 female and 100
male (randomly) to ask their opinion on a particular
product.Then based on these 1000 opinion you can’t decide
the opinion of that entire state on your product. This is
random sampling.
 But in Stratified Sampling, Let the population for that state be
52% male and 48% female, Then for choosing 1000 people from
that state if you pick 520 male ( 52% of 1000 ) and 480 female (
48% for 1000 ) i.e 520 male + 480 female (Total=1000 people)
to ask their opinion. Then these groups of people represent the
entire state.This is called as Stratified Sampling.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 17
Stratified k fold Validation Method
 Stratified k-fold cross-validation is same as just k-fold cross-
validation, But in Stratified k-fold cross-validation, it does
stratified sampling instead of random sampling.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 18
Stratified k fold Validation Method
 Cross-validation implemented using stratified sampling ensures
that the proportion of the feature of interest is the same
across the original data, training set and the test set.
 This ensures that no value is over/under-represented in the
training and test sets, which gives a more accurate estimate of
performance/error.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 19
Leave-P-Out cross validation / LoOCV
 When using this exhaustive method, we take p number of
points out from the total number of data points in the
dataset(say n).
 While training the model we train it on these (n – p) data
points and test the model on p data points.
 We repeat this process for all the possible combinations of p
from the original dataset.
 Then to get the final accuracy, we average the accuracies from
all these
 LoOCV (Leave one Out CV) is a simple variation of Leave-P-
Out cross validation and the value of p is set as one. This
makes the method much less exhaustive as now for n data
points and p = 1, we have n number of combinations. iterations.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 20
Leave-P-Out cross validation / LoOCV
 LOOCV is the case of Cross-Validation where just a single
observation is held out for validation.
 The model is evaluated for every held out observation. The
final result is then calculated by taking the mean of all the
individual evaluations.
 There are two problems with LOOCV.
1. It can be computationally expensive to use LOOCV,
particularly if the data size is large and also if the model
takes substantial time to complete the learning just once.
This is because we are iteratively fitting the model on the
whole training set.
2. The other problem with LOOCV is that it can be subject
to high variance or over fitting as we are feeding the model
almost all the training data to learn and just a single
observation to evaluate.
NEERAJ KHARYA, DEPARTMENT OF
COMPUTER APPLICATIONS, BIT DURG 21
Leave-P-Out cross validation / LoOCV

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 22
Bootstrapping
 A statistical concept, Bootstrapping is a resampling method used to
stimulate samples out of a data set using the replacement technique.
The process of bootstrapping allows one to infer data about the
population, derive standard errors, and ensure that data is tested
efficiently.
 The Bootstrap Sampling Method is a very simple concept and is a
building block for some of the more advanced machine learning
algorithms like AdaBoost and GBoost.
 Technically speaking, the bootstrap sampling method is a resampling
method that uses random sampling with replacement.
 This technique involves repeatedly sampling a dataset with random
replacement. A statistical test that falls under the category of
resampling methods, this method ensures that the statistics
evaluated are accurate and unbiased as much as possible.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 23
Bootstrapping
 Invented by Bradley Efron, the bootstrapping method is
known to generate new samples or resamples out of the
already existing samples in order to measure the accuracy of a
sample statistic.
 Using the replacement technique, the method creates new
hypothetical samples that help in the testing of an estimated
value.
 For the samples that are chosen in the representative sample
size, they are referred to as the ‘Bootstrapped samples’ or the
bootstrap sample size. On the other hand, the samples that are
not chosen are referred to as the ‘Out-of-the-bag’ samples that
serve as the testing dataset.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 24
Bootstrapping
 Following are the steps that are involved in the Bootstrapping
method -
1. Randomly choose a sample size.
2. Pick an observation from the training dataset in random order.
3. Combine this observation with the sample chosen earlier.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 25
Bootstrapping

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 26
Support Vector Machine
 Support Vector Machine” (SVM) is a supervised machine
learning algorithm that can be used for both classification or
regression challenges.
 In the SVM algorithm, we plot each data item as a point in n-
dimensional space (where n is a number of features you have)
with the value of each feature being the value of a particular
coordinate. Then, we perform classification by finding the
hyper-plane that differentiates the two classes very well.
 The main goal of SVM is to divide the datasets into classes to
find a maximum marginal hyperplane (MMH) and it can be
done in the following two steps −
1. First, SVM will generate hyperplanes iteratively that segregates
the classes in best way.
2. Then, it will choose the hyperplane that separates the classes
correctly.
NEERAJ KHARYA, DEPARTMENT OF
COMPUTER APPLICATIONS, BIT DURG 27
Support Vector Machine

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 28
Support Vector Machine
 The followings are important concepts in SVM −
 Support Vectors − Datapoints that are closest to the
hyperplane is called support vectors. Separating line will be
defined with the help of these data points.
 Hyperplane − As we can see in the above diagram, it is a
decision plane or space which is divided between a set of
objects having different classes.
 Margin − It may be defined as the gap between two lines on
the closet data points of different classes. It can be calculated
as the perpendicular distance from the line to the support
vectors. Large margin is considered as a good margin and small
margin is considered as a bad margin.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 29
Support Vector Machine
 SVM algorithm is implemented with kernel that transforms an
input data space into the required form. SVM uses a technique
called the kernel trick in which kernel takes a low dimensional
input space and transforms it into a higher dimensional space.
 The following are some of the types of kernels used by SVM.
1. Linear Kernel : It can be used as a dot product between any
two observations.
2. Polynomial Kernel : It is more generalized form of linear
kernel and distinguish curved or nonlinear input space.
3. Radial Basis Function (RBF) Kernel : RBF kernel, mostly
used in SVM classification, maps input space in indefinite
dimensional space.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 30
Support Vector Machine
 Pros:
◦ It works really well with a clear margin of separation
◦ It is effective in high dimensional spaces.
◦ It is effective in cases where the number of dimensions is
greater than the number of samples.
◦ It uses a subset of training points in the decision function
(called support vectors), so it is also memory efficient.
 Cons:
◦ It doesn’t perform well when we have large data set because
the required training time is higher
◦ It also doesn’t perform very well, when the data set has
more noise i.e. target classes are overlapping
◦ SVM doesn’t directly provide probability estimates, these are
calculated using an expensive five-fold cross-validation. It is
included in the related SVC method of Python scikit-learn
library.
NEERAJ KHARYA, DEPARTMENT OF
COMPUTER APPLICATIONS, BIT DURG 31
Confusion Matrix
 A confusion matrix is a table that is often used to describe
the performance of a classification model (or "classifier")
on a set of test data for which the true values are known.
 Confusion Matrix is a performance measurement for machine
learning classification.
 Classification accuracy alone can be misleading if you have an
unequal number of observations in each class or if you have
more than two classes in your dataset.
 Calculating a confusion matrix can give you a better idea of
what your classification model is getting right and what types
of errors it is making.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 32
Confusion Matrix
 The following are the most basic terms which are not rating,
just a whole number:
◦ True Positives (TP): These are cases in which we predicted yes (they
have the disease), and they do have the disease.
◦ True Negatives (TN): We predicted no, and they don't have the
disease.
◦ False Positives (FP): We predicted yes,
but they don't actually have the disease.
(Also known as a "Type I error.")
◦ False Negatives (FN): We predicted no,
but they actually do have the disease.
(Also known as a "Type II error.")

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 33
Confusion Matrix
 Let’s understand TP, FP, FN, TN in terms of pregnancy analogy.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 34
Metrics derived from Confusion Matrix
 Accuracy: The accuracy is used to find the portion of correctly classified
values. It tells us how often our classifier is right. It is the sum of all true
values divided by total values.

 Precision: Precision is used to calculate the model's ability to classify


positive values correctly. It is the true positives divided by the total number
of predicted positive values.

 Recall/Sensitivity/True Positive Rate (TPR) : It is used to calculate the


model's ability to predict positive values. "How often does the model
predict the correct positive values?". It is the true positives divided by the
total number of actual positive values. Sensitivity tells us what proportion
of the positive class got correctly classified.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 35
Metrics derived from Confusion Matrix

 False Negative Rate (FNR) : False Negative Rate (FNR) tells us what
proportion of the positive class got incorrectly classified by the classifier.

 A higher TPR and a lower FNR is desirable since we want to correctly


classify the positive class
 Specificity / True Negative Rate (TNR) : Specificity tells us what proportion
of the negative class got correctly classified.

 False Positive Rate (FPR): FPR tells us what proportion of the negative class
got incorrectly classified by the classifier.
 A higher TNR and a lower FPR is desirable since we want to correctly
classify the negative class.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 36
Metrics derived from Confusion Matrix
 F1-Score: It is the harmonic mean of Recall and Precision. It is useful when you
need to take both Precision and Recall into account.

 Compute all metrics for the confusion matrix made for a classifier
that classifies people based on whether they speak English or
Spanish.

◦ True Positives (TP) = 86


◦ True Negatives (TN) = 79
◦ False Positives (FP) = 12
◦ False Negatives (FN) = 10

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 37
Confusion Matrix for Multi-Class Classification
 confusion matrix for a multiclass problem where we have to predict
whether a person loves Facebook, Instagram or Snapchat. The confusion
matrix would be a 3 x 3 matrix like this:

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 38
Metrics derived from Confusion Matrix

https://www.inabia.com/learning/quiz/confusion-matrix-quiz/

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 39
AUC-ROC Curve
 The Receiver Operator Characteristic (ROC) curve is an
evaluation metric for binary classification problems.
 It is a probability curve that plots the TPR against FPR at
various threshold values and essentially separates the
‘signal’ from the ‘noise’.
 The Area Under the Curve (AUC) is the measure of the
ability of a classifier to distinguish between classes and is used
as a summary of the ROC curve.
 An excellent model has AUC near to the 1 which means it has
a good measure of separability. A poor model has an AUC near
0 which means it has the worst measure of separability.
 The higher the AUC, the better the performance of the model
at distinguishing between the positive and negative classes.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 40
AUC-ROC Curve
 The ROC curve is plotted with TPR against the FPR where TPR
is on the y-axis and FPR is on the x-axis.

 In a ROC curve, a higher X-axis value indicates a higher


number of False positives than True negatives. While a higher Y-
axis value indicates a higher number of True positives than
False negatives. So, the choice of the threshold depends on the
ability to balance between False positives and False negatives.
NEERAJ KHARYA, DEPARTMENT OF
COMPUTER APPLICATIONS, BIT DURG 41
AUC-ROC Curve
 When AUC = 1, then the classifier is able to perfectly distinguish
between all the Positive and the Negative class points correctly.

 Red distribution curve is of the positive class (patients with disease) and the green distribution
curve is of the negative class(patients with no disease).

 If, however, the AUC had been 0, then the classifier would be
predicting all Negatives as Positives, and all Positives as Negatives.
 This is an ideal situation. When two curves don’t overlap at all means
model has an ideal measure of separability. It is perfectly able to
distinguish between positive class and negative class.
NEERAJ KHARYA, DEPARTMENT OF
COMPUTER APPLICATIONS, BIT DURG 42
AUC-ROC Curve
 When 0.5<AUC<1, there is a high chance that the classifier will be
able to distinguish the positive class values from the negative class
values. This is so because the classifier is able to detect more
numbers of True positives and True negatives than False negatives and
False positives.

 When two distributions overlap, we introduce type 1 and type 2


errors. Depending upon the threshold, we can minimize or maximize
them. When AUC is 0.7, it means there is a 70% chance that the
model will be able to distinguish between positive class and negative
class.
NEERAJ KHARYA, DEPARTMENT OF
COMPUTER APPLICATIONS, BIT DURG 43
AUC-ROC Curve
 When AUC=0.5, then the classifier is not able to distinguish
between Positive and Negative class points. Meaning either the
classifier is predicting random class or constant class for all the
data points.

 This is the worst situation. When AUC is approximately 0.5, the


model has no discrimination capacity to distinguish between
positive class and negative class.
 So, the higher the AUC value for a classifier, the better its
ability to distinguish between positive and negative classes.
NEERAJ KHARYA, DEPARTMENT OF
COMPUTER APPLICATIONS, BIT DURG 44
AUC-ROC Curve
 When AUC is approximately 0, the model is actually
reciprocating the classes. It means the model is predicting a
negative class as a positive class and vice versa.

 Sensitivity and Specificity are inversely proportional to each


other. So when we increase Sensitivity, Specificity decreases,
and vice versa.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 45
Naïve Bayesian Classifier
 The different naive Bayes classifiers differ mainly by the
assumptions they make regarding the distribution of P(xi | y)..
 Gaussian Naive Bayes classifier : continuous values
associated with each feature are assumed to be distributed
according to a Gaussian distribution.
 Multinomial Naive Bayes: Feature vectors represent the
frequencies with which certain events have been generated by
a multinomial distribution. This is the event model typically
used for document classification.
 Bernoulli Naive Bayes: In the multivariate Bernoulli event
model, features are independent booleans (binary variables)
describing inputs.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 46
Advantages : Naïve Bayesian Classifier
 They require a small amount of training data to estimate the
necessary parameters.
 Naive Bayes learners and classifiers can be extremely fast
compared to more sophisticated methods.
 Naive Bayes has very low computation cost.
 It can efficiently work on a large dataset.
 It performs well in case of discrete response variable compared
to the continuous variable.
 It can be used with multiple class prediction problems.
 It also performs well in the case of text analytics problems.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 47
Disadvantages : Naïve Bayesian Classifier
 The assumption of independent features. In practice, it is
almost impossible that model will get a set of predictors which
are entirely independent.
 If there is no training tuple of a particular class, this causes zero
posterior probability. In this case, the model is unable to make
predictions. This problem is known as Zero
Probability/Frequency Problem.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 48
Applications : Naïve Bayesian Classifier
 Real time Prediction: Naive Bayes is an eager learning classifier
and it is sure fast. Thus, it could be used for making predictions in
real time.
 Multi class Prediction: This algorithm is also well known for multi
class prediction feature. Here we can predict the probability of
multiple classes of target variable.
 Text classification/ Spam Filtering/ Sentiment Analysis: Naive
Bayes classifiers mostly used in text classification (due to better
result in multi class problems and independence rule) have higher
success rate as compared to other algorithms
 Recommendation System: Naive Bayes Classifier
and Collaborative Filtering together builds a Recommendation
System that uses machine learning and data mining techniques to
filter unseen information and predict whether a user would like a
given resource or not
NEERAJ KHARYA, DEPARTMENT OF
COMPUTER APPLICATIONS, BIT DURG 49
Discriminative Learning
 Discriminative Model makes predictions on the unseen data
based on conditional probability and can be used either for
classification or regression problem statements
 Generative models are useful for unsupervised learning
 The discriminative model is used particularly for supervised
machine learning. Also called a conditional model, it learns the
boundaries between classes or labels in a dataset.
 Examples of Discriminative Models
• Logistic regression
• Scalar Vector Machine (SVMs)
• Traditional neural networks
• Nearest neighbor
• Conditional Random Fields (CRFs)
• Decision Trees and Random Forest
NEERAJ KHARYA, DEPARTMENT OF
COMPUTER APPLICATIONS, BIT DURG 50
Discriminative Learning
 Generative model focuses on the distribution of a dataset to
return a probability for a given example.
 Generative models are useful for unsupervised learning
 Since these types of models often rely on the Bayes theorem
to find the joint probability, so generative models can tackle a
more complex task than analogous discriminative models.
 These models are used in unsupervised machine learning as a
means to perform tasks such as
• Probability and Likelihood estimation,
• Modeling data points,
• To describe the phenomenon in data,
• To distinguish between classes based on these probabilities

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 51
Generative Learning
 Some Examples of Generative Models
◦ Naïve Bayes
◦ Bayesian networks
◦ Markov random fields
◦ Hidden Markov Models (HMMs)
◦ Latent Dirichlet Allocation (LDA)
◦ Generative Adversarial Networks (GANs)
◦ Autoregressive Model
 Major drawback – If there is a presence of outliers in
the dataset, then it affects these types of models to a
significant extent.

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 52
Generative & Discriminative Learning

NEERAJ KHARYA, DEPARTMENT OF


COMPUTER APPLICATIONS, BIT DURG 53
Generative & Discriminative Learning
 Performance
 Generative models need fewer data to train compared with
discriminative models since generative models are more biased as
they make stronger assumptions i.e, assumption of conditional
independence.
 Based on Missing Data
 Generative models can work with these missing data, while on the
contrary discriminative models can’t. This is because, in generative
models, still we can estimate the posterior by marginalizing over the
unseen variables. However, for discriminative models, we usually
require all the features X to be observed.
 Based on Accuracy Score
 If the assumption of conditional independence violates, then at that
time generative models are less accurate than discriminative models.
NEERAJ KHARYA, DEPARTMENT OF
COMPUTER APPLICATIONS, BIT DURG 54

You might also like