Unit 7 - 2
Unit 7 - 2
• p(Y,X) = P(y,x1,x2…xn)
Discriminative and Generative Models
• Now, our goal is to estimate the probability of spam email i.e, P(Y=1|X).
• Both generative and discriminative models can solve this problem but in
different ways.
• The approach of Generative Models
• In the case of generative models, to find the conditional probability P(Y|
X), they estimate the prior probability P(Y) and likelihood probability
P(X|Y) with the help of the training data and uses the Bayes Theorem to
calculate the posterior probability P(Y |X):
Discriminative and Generative Models
• In the case of discriminative models, to find the probability, they directly
assume some functional form for P(Y|X) and then estimate the parameters
of P(Y|X) with the help of the training data.
• The discriminative model refers to a class of models used in Statistical
Classification, mainly used for supervised machine learning.
• These types of models are also known as conditional models since they
learn the boundaries between classes or labels in a dataset.
Discriminative and Generative Models
• Discriminative models (just as in the literal meaning) separate classes
instead of modeling the conditional probability and don’t make any
assumptions about the data points.
• But these models are not capable of generating new data points.
Therefore, the ultimate objective of discriminative models is to separate
one class from another.
• If we have some outliers present in the dataset, then discriminative
models work better compared to generative models i.e, discriminative
models are more robust to outliers. However, there is one major drawback
of these models is the misclassification problem, i.e., wrongly classifying
a data point.
Discriminative and Generative Models
Generative Models
• Generative models are considered as a class of statistical models that can
generate new data instances. These models are used in unsupervised
machine learning as a means to perform tasks such as
• Assume some functional form for the probabilities such as P(Y), P(X|Y)
• With the help of training data, we estimate the parameters of P(X|Y), P(Y)
• Use the Bayes theorem to calculate the posterior probability P(Y |X)
• Some Examples of Generative Models
• Naïve Bayes
• Generative Adversarial Networks (GANs)
• Hidden Markov Models (HMMs)
Difference between Discriminative and
Generative Models
• Discriminative models draw boundaries in the data space, while
generative models try to model how data is placed throughout the space.
• A generative model focuses on explaining how the data was generated,
while a discriminative model focuses on predicting the labels of the data.
• In mathematical terms, a discriminative machine learning trains a model
which is done by learning parameters that maximize the conditional
probability P(Y|X), while on the other hand, a generative model learns
parameters by maximizing the joint probability of P(X, Y).
Difference between Discriminative and
Generative Models
• Discriminative models recognize existing data i.e, discriminative
modeling identifies tags and sorts data and can be used to classify data
while Generative modeling produces something.
• Generative models have more impact on outliers than discriminative
models.
• Discriminative models are computationally cheap as compared to
generative models.
Naïve Bayes Algorithm
• Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
• It is mainly used in text classification that includes a high-dimensional
training dataset.
• Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
Naïve Bayes Algorithm
• Bayes Theorem
• Bayes’ Theorem is a simple mathematical formula used for calculating
conditional probabilities.
• Conditional probability is a measure of the probability of an event
occurring given that another event has (by assumption, presumption,
assertion, or evidence) occurred.
Naïve Bayes Algorithm
Naïve Bayes Algorithm
• Which tells us: how often A happens given that B happens, written P(A|
B) also called posterior probability, When we know: how often B happens
given that A happens, written P(B|A) and how likely A is on its own,
written P(A) and how likely B is on its own, written P(B).
• In simpler terms, Bayes’ Theorem is a way of finding a probability when
we know certain other probabilities.
Naïve Bayes Algorithm
• The fundamental Naïve Bayes assumption is that each feature makes an:
• independent
• equal
contribution to the outcome.
• Let us take an example to get some better intuition. Consider the car theft
problem with attributes Color, Type, Origin, and the target, Stolen can be
either Yes or No.
Naïve Bayes Algorithm
Naïve Bayes Algorithm
• Concerning our dataset, the concept of assumptions made by the
algorithm can be understood as:
• We assume that no pair of features are dependent. For example, the color
being ‘Red’ has nothing to do with the Type or the Origin of the car.
Hence, the features are assumed to be Independent.
• Secondly, each feature is given the same influence(or importance). For
example, knowing the only Color and Type alone can’t predict the
outcome perfectly. So none of the attributes are irrelevant and assumed to
be contributing Equally to the outcome.
Naïve Bayes Algorithm
• The assumptions made by Naïve Bayes are generally not correct in real-
world situations. The independence assumption is never correct but often
works well in practice. Hence the name ‘Naï>ve’.
• Here in our dataset, we need to classify whether the car is stolen, given
the features of the car. The columns represent these features and the rows
represent individual entries. If we take the first row of the dataset, we can
observe that the car is stolen if the Color is Red, the Type is Sports and
Origin is Domestic. So we want to classify a Red Domestic SUV is
getting stolen or not. Note that there is no example of a Red Domestic
SUV in our data set.
Naïve Bayes Algorithm
• Now, you can obtain the values for each by looking at the dataset and
substitute them into the equation.
Naïve Bayes Algorithm
• The posterior probability P(y|X) can be calculated by first, creating a
Frequency Table for each attribute against the target.
• Then, molding the frequency tables to Likelihood Tables and finally, use
the Naïve Bayesian equation to calculate the posterior probability for each
class.
• The class with the highest posterior probability is the outcome of the
prediction.
• Below are the Frequency and likelihood tables for all three predictors.
Naïve Bayes Algorithm
• Since 0.144 > 0.048, Which means given the features RED SUV and
Domestic, our example gets classified as ’NO’ the car is not stolen.
Support Vector Machines
• A Support Vector Machine (SVM) is a powerful and versatile Machine
Learning model, capable of performing linear or nonlinear classification,
regression, and even outlier detection.
• SVMs are particularly well suited for classification of complex small- or
medium-sized datasets
• Linear SVM Classification
Linear SVM Classification
• The two classes can clearly be separated easily with a straight line (they are
linearly separable). The left plot shows the decision boundaries of three possible
linear classifiers. The model whose decision boundary is represented by the
dashed line is so bad that it does not even separate the classes properly. The other
two models work perfectly on this training set, but their decision boundaries
come so close to the instances that these models will probably not perform as
well on new instances.
• In contrast, the solid line in the plot on the right represents the decision boundary
of an SVM classifier; this line not only separates the two classes but also stays as
far away from the closest training instances as possible. You can think of an SVM
classifier as fitting the widest possible street (represented by the parallel dashed
lines) between the classes. This is called large margin classification.
Linear SVM Classification
• Notice that adding more training instances “off the street” will not affect
the decision boundary at all: it is fully determined (or “supported”) by the
instances located on the edge of the street. These instances are called the
support vectors (they are circled in Figure 5-1).
Linear SVM Classification
• SVMs are sensitive to the feature scales, as you can see in Figure 5-2: in
the left plot, the vertical scale is much larger than the horizontal scale, so
the widest possible street is close to horizontal.
• After feature scaling (e.g., using Scikit-Learn’s StandardScaler), the
decision boundary in the right plot looks much better.
Soft Margin Classification
• If we strictly impose that all instances must be off the street and on the
right side, this is called hard margin classification. There are two main
issues with hard margin classification. First, it only works if the data is
linearly separable. Second, it is sensitive to outliers. Figure 5-3 shows the
iris dataset with just one additional outlier: on the left, it is impossible to
find a hard margin; on the right, the decision boundary ends up very
different from the one we saw in Figure 5-1 without the outlier, and it will
probably not generalize as well.
Soft Margin Classification
Soft Margin Classification
• To avoid these issues, use a more flexible model. The objective is to find
a good balance between keeping the street as large as possible and
limiting the margin violations (i.e., instances that end up in the middle of
the street or even on the wrong side).
• This is called soft margin classification
Soft Margin Classification
• When creating an SVM model using Scikit-Learn, we can specify a
number of hyperparameters. C is one of those hyperparameters. If we set
it to a low value, then we end up with the model on the left of Figure 5-4.
• With a high value, we get the model on the right. Margin violations are
bad. It’s usually better to have few of them. However, in this case the
model on the left has a lot of margin violations but will probably
generalize better.
• If your SVM model is overfitting, you can try regularizing it by reducing
C.
Soft Margin Classification
Nonlinear SVM Classification
• Although linear SVM classifiers are efficient and work surprisingly well
in many cases, many datasets are not even close to being linearly
separable. One approach to handling nonlinear datasets is to add more
features, such as polynomial features (as in some cases this can result in a
linearly separable dataset.
• Consider the left plot in Figure 5-5: it represents a simple dataset with just
one feature, x1. This dataset is not linearly separable, as you can see. But
if you add a second feature x2 = (x1)^2, the resulting 2D dataset is
perfectly linearly separable
Nonlinear SVM Classification
Nonlinear SVM Classification
Polynomial Kernel
• Adding polynomial features is simple to implement and can work great with
all sorts of Machine Learning algorithms (not just SVMs). That said, at a
low polynomial degree, this method cannot deal with very complex datasets,
and with a high polynomial degree it creates a huge number of features,
making the model too slow.
• Fortunately, when using SVMs you can apply an almost miraculous
mathematical technique called the kernel trick (explained in a moment). The
kernel trick makes it possible to get the same result as if you had added
many polynomial features, even with very high-degree polynomials, without
actually having to add them. So there is no combinatorial explosion of the
number of features because you don’t actually add any features.
Polynomial Kernel
• SVM classifier using a third-degree polynomial kernel is represented on
the left in Figure 5-7. On the right is another SVM classifier using a 10th
degree polynomial kernel. Obviously, if your model is overfitting, you
might want to reduce the polynomial degree. Conversely, if it is
underfitting, you can try increasing it.
Similarity Features
• Another technique to tackle nonlinear problems is to add features
computed using a similarity function, which measures how much each
instance resembles a particular landmark. For example, let’s take the 1D
dataset discussed earlier and add two land‐marks to it at x1 = –2 and x1 =
1 (see the left plot in Figure 5-8). Next, let’s define the similarity function
to be the Gaussian Radial Basis Function (RBF) with γ = 0.3 (see
Equation 5-1).
Similarity Features
Similarity Features
• This is a bell-shaped function varying from 0 (very far away from the
landmark) to 1 (at the landmark). Now we are ready to compute the new
features. For example, let’s look at the instance x1 = –1: it is located at a
distance of 1 from the first landmark and 2 from the second landmark.
• Therefore its new features are x2 = exp(–0.3 × 12) ≈ 0.74 and x3 = exp(–
0.3 × 22) ≈ 0.30. The plot on the right in Figure 5-8 shows the
transformed dataset (dropping the original features). As you can see, it is
now linearly separable.
Similarity Features
• You may wonder how to select the landmarks. The simplest approach is
to create a landmark at the location of each and every instance in the
dataset. Doing that creates many dimensions and thus increases the
chances that the transformed training set will be linearly separable.
• The downside is that a training set with m instances and n features gets
transformed into a training set with m instances and m features (assuming
you drop the original features).
• If your training set is very large, you end up with an equally large number
of features
Similarity Features
• The other plots show models trained with different values of
hyperparameters gamma (γ) and C. Increasing gamma makes the bell-
shaped curve narrower (see the lefthand plots in Figure 5-8). As a result,
each instance’s range of influence is smaller: the decision boundary ends
up being more irregular, wiggling around individual instances.
Conversely, a small gamma value makes the bell-shaped curve wider:
instances have a larger range of influence, and the decision boundary ends
up smoother. So γ acts like a regularization hyperparameter: if your
model is overfitting, you should reduce it; if it is underfitting, you should
increase it (similar to the C hyperparameter).
Similarity Features
Similarity Features
• The other plots show models trained with different values of
hyperparameters gamma (γ) and C. Increasing gamma makes the bell-
shaped curve narrower (see the lefthand plots in Figure 5-8). As a result,
each instance’s range of influence is smaller: the decision boundary ends
up being more irregular, wiggling around individual instances.
Conversely, a small gamma value makes the bell-shaped curve wider:
instances have a larger range of influence, and the decision boundary ends
up smoother. So γ acts like a regularization hyperparameter: if your
model is overfitting, you should reduce it; if it is underfitting, you should
increase it (similar to the C hyperparameter).
Decision Function and Predictions
• The linear SVM classifier model predicts the class of a new instance x by
simply computing the decision function w⊺ x + b = w1 x1 + ⋯ + wn xn +
b. If the result is positive, the predicted class ŷ is the positive class (1),
and otherwise it is the negative class (0);see Equation 5-2.
Decision Function and Predictions
• Figure 5-12 shows the decision function that corresponds to the model in
the left in Figure 5-4: it is a 2D plane because this dataset has two
features (petal width and petal length). The decision boundary is the set of
points where the decision function is equal to 0: it is the intersection of
two planes, which is a straight line (represented by the thick solid line).
Decision Function and Predictions
Decision Function and Predictions
• The dashed lines represent the points where the decision function is equal
to 1 or –1: they are parallel and at equal distance to the decision
boundary, and they form a margin around it. Training a linear SVM
classifier means finding the values of w and b that make this margin as
wide as possible while avoiding margin violations (hard margin) or
limiting them (soft margin).
Training Objective
• Consider the slope of the decision function: it is equal to the norm of the
weight vector, ∥ w ∥. If we divide this slope by 2, the points where the
decision function is equal to ±1 are going to be twice as far away from the
decision boundary. In other words, dividing the slope by 2 will multiply
the margin by 2. This may be easier to visualize in 2D, as shown in Figure
5-13. The smaller the weight vector w, the larger the margin.
Training Objective
Training Objective
Training Objective
Quadratic Programming
• The hard margin and soft margin problems are both convex quadratic
optimization problems with linear constraints. Such problems are known
as Quadratic Programming (QP) problems.
• Many off-the-shelf solvers are available to solve QP problems by using a
variety of techniques.
Quadratic Programming