Probabilistic Models For Classification
Probabilistic Models For Classification
for Classification
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 1
Binary Classification Problem
•N iid training samples:
• Class label:
• Feature vector:
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 2
Generative models for classification
• Model joint probability
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 3
Generative Process for Data
• Enables
generation of new data points
• Repeat N times
• Sample class
• Sample feature value
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 4
Conditional Probability in a Generative Model
•
where
• Logistic function
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 5
Case: Binary classification with Gaussians
• Parameters
• Note: Covariance parameter is shared
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 6
Case: Binary classification with Gaussians
•
Where
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 7
Special Cases
•
• Class boundary:
• Arbitrary
• Decision boundary still linear but
not orthogonal to the hyper-plane
joining the two means
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 8
MLE for Binary Gaussian
• Formulate loglikelihood in terms of parameters
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 9
Case: Gaussian Multi-class Classification
•
• Prior
• Class conditional densities
where
• Soft-max / normalized exponential function
• For Gaussian class conditionals
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 10
MLE for Gaussian Multi-class
• Similar to the Binary case
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 11
Case: Naïve Bayes
• Similar
to Gaussian setting, only features are
discrete (binary, for simplicity)
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 12
Case: Naïve Bayes
• Class conditional probability
• Posterior probability
Where
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 13
MLE for Naïve Bayes
• Formulate loglikelihood in terms of parameters
• MLE overfits
• Susceptible to 0 frequencies in training data
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 14
Bayesian Estimation for Naïve Bayes
• Model
the parameters as random variables and
analyze posterior distributions
• Take point estimates if necessary
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 15
Discriminative Models for Classification
• Familiar
form for posterior class
distribution
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 16
Logistic Regression for Binary Classification
where
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 17
MLE for Binary Logistic Regression
• Maximize likelihood wrt weights
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 18
MLE for Binary Logistic Regression
• Not
quadratic but still convex
• Iterative optimization using gradient descent (LMS
algorithm)
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 19
Bayesian Binary Logistic Regression
• Bayesian model exists, but intractable
• Conjugacy breaks down because of the sigmoid function
• Laplace approximation for the posterior
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 20
Soft-max regression for Multi-class Classification
• Left as exercise
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 21
Choices for the activation function
• Probit function: CDF of the Gaussian
• Complementary log-log model: CDF of exponential
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 22
Generative vs Discriminative: Summary
• Generative models
• Easy parameter estimation
• Require more parameters OR simplifying assumptions
• Models and “understands” each class
• Easy to accommodate unlabeled data
• Poorly calibrated probabilities
• Discriminative models
• Complicated estimation problem
• Fewer parameters and fewer assumptions
• No understanding of individual classes
• Difficult to accommodate unlabeled data
• Better calibrated probabilities
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 23
Decision Theory
• From posterior distributions to actions
• Loss functions measure extent of error
• Optimal action depends on loss function
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 24
Loss functions
• 0-1 loss
• loss
• loss
• Consider
class conditional distribution
• Decision rule:
• Confusion Matrix
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 26
ROC curves
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 27
ROC curves
• Quality of classifier
measured by area under
the curve (AUC)
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 28
Precision-recall curves
settings such as
•• In
information retrieval,
• Precision =
• Recall =
• Plot precision vs recall for
varying values of threshold
• Quality of classifier
measured by area under the
curve (AUC) or by specific
values e.g. P@k
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 29
F1-scores
• To
evaluate at a single threshold, need to combine
precision and recall
• Harmonic mean
• Why?
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 30
Estimating generalization error
• Training set performance is not a good indicator of
generalization error
• A more complex model overfits, a less complex one underfits
• Which model do I select?
• Validation set
• Typically 80%, 20%
• Wastes valuable labeled data
• Cross validation
• Split training data into K folds
• For ith iteration, train on K/i folds, test on i th fold
• Average generalization error over all folds
• Leave one out cross validation: K=N
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 31
Summary
• Generative models
• Gaussian Discriminant Analysis
• Naïve Bayes
• Discriminative models
• Logistics regression
• Iterative algorithms for training
• Binary vs Multiclass
• Generalization performance
• Cross validation
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 32