ML Unit-4
ML Unit-4
Dr. T.Bhaskar
Associate Professor
Google-Site: https://sites.google.com/view/bhaskart/ug-notes/machine-learning
Moodle-Site: https://proftbhaskar.gnomio.com/course/view.php?id=5 (Log in as Guest)
ML YouTube Playlist: https://tinyurl.com/ML-DrBhaskarT
Contents
• Bayes‟ Theorom, Naïve Bayes‟ Classifiers, Naïve Bayes in Scikit- learn- Bernoulli
Naïve Bayes,Multinomial Naïve Bayes, and Gaussian Naïve Bayes.
• In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class
is unrelated to the presence of any other feature
• For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter.
• Even if these features depend on each other or upon the existence of the other features, all of these
properties independently contribute to the probability that this fruit is an apple and that is why it is known
as ‘Naive’. ?
• P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
• Let’s understand it using an example. Below I have a training data set of weather and
corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify
whether players will play or not based on weather condition.
• Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
• Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
• Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple
classes.
• Our task is to classify new cases as they arrive, i.e., decide to which class label they belong, based
on the currently exiting objects.
• Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case
(which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED.
• In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based
on previous experience, in this case the percentage of GREEN and RED objects, and often used to
predict outcomes before they actually happen.
• In the Bayesian analysis, the final classification is produced by combining both sources of
information, i.e., the prior and the likelihood, to form a posterior probability using the so-
called Bayes' rule (named after Rev. Thomas Bayes 1702-1761).
Using Bayes' rule above, we label a new case X with a class level Cj
that achieves the highest posterior probability.
Dr.T.Bhaskar,Dept. of Computer Engineering, Sanjivani CoE, Kopargaon
• Naive Bayes in scikit-learn- scikit-learn implements three naive Bayes variants based on
the same number of different probabilistic distributions:
• The first one is a binary distribution, useful when a feature can be present or absent.
• The second one is a discrete distribution and is used whenever a feature must be
represented by a whole number (for example, in natural language processing, it can be
the frequency of a term),
• while the third is a continuous distribution characterized by its mean and variance.
• It's possible to assign all non-negative values; however, larger values will assign
higher probabilities to the missing features and this choice could alter the
stability of the model. In our example, we're going to consider the default value
of 1.0.
>>> data = [
{'house': 100, 'street': 50, 'shop': 25, 'car': 100, 'tree': 20},
>>> dv = DictVectorizer(sparse=False)
>>> X = dv.fit_transform(data)
>>> X
[10.,5.,1.,0.,5., 500.]])
Dr.T.Bhaskar,Dept. of Computer Engineering, Sanjivani CoE, Kopargaon
• Note that the term 'river' is missing from the first set, so it's useful to keep alpha equal to 1.0 to
give it a small probability. The output classes are 1 for city and 0 for the countryside.
>>> mnb.fit(X, Y)
• To test the model, we create a dummy city with a river and a dummy countryside place without
any river:
# plotting scatter
plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring')
Now let’s train the classifier using our training data. Before training, we need to import cancer datasets as csv
file where we will train two features out of all features.
# importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
[ 108.3 858.1 ]
[ 140.1 1265. ]
[ 47.92 181. ]]
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 1., 1., 1., 0., 0.,
1., 0., 0., 1., 1., 1., 1., 0., 1., ...., 1.])
• Now we’ll fit a Support Vector Machine Classifier to these points. While the
mathematical details of the likelihood model are interesting, we’ll let read about
those elsewhere. Instead, we’ll just treat the scikit-learn algorithm as a black box
which accomplishes the above task.
clf = SVC(kernel='linear')
clf.fit(x, y)
k(x, z) = (x⊤z)2
= (x1 z1 + x2 z2 )2
= x 2 z12 +1 x 2 z2 22 + 2x1 x2 z1 z2
√ √
= (x , 2x
1
2
x , x )
1 2(z , 2⊤2
2z
2 z , 1z ) 122
2
= φ(x) φ(z)⊤
Kernel k(x, z) takes two inputs and gives their similarity in F space
φ:X→F
k : X × X → R, k(x, z) = φ(x)⊤φ(z)
There must exist a Hilbert Space F for which k defines a dot product
The kernel function k also defines the Kernel Matrix K over the data
K is a symmetric matrix
N
Σ
subject to αn yn = 0, 0 ≤ αn ≤ C ; n = 1, . . .
,N
n=1
Replacing xT xn by φ(xm )⊤φ(xn ) = k(xm , xn ) = Kmn , where k(., .) is
some m
suitable kernel
function Σ N
Σ 1 N
Maximize LD (w, b, ξ, α, β) = αn − αm αn ym yn Kmn
2
n=1 m,n=1
N
Σ
subject to αn yn = 0, 0 ≤ αn ≤ C ; n = 1, . . .
,N
n=1
SVM now learns a linear separator in the kernel defined feature
space F
This corresponds to a non-linear separator in the original space X
• With real datasets, SVM can extract a very large number of support vectors to
increase accuracy, and that can slow down the whole process.
• Let's consider an example with a linear kernel and a simple dataset. In the
following figure, there's a scatter plot of our set:
Text Books:
Sr. No. Title of Book Authors Publication House
1 Machine Learning Algorithms Giuseppe Bonaccorso Packt Publishing Limited,
Reference Books:
Sr. No. Title of Book Authors Publication House
1 Introduction to Machine Learning Ethem Alpaydin PHI
2 Machine Learning: The Art and Science Peter Flach Cambridge University Press
of Algorithms that Make Sense of Data
3 Machine Learning Tom Mitchell McGraw Hill Publication