ML Module 3 2022

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

MACHINE LEARNING(CS30110)

6TH SEM CSE

Dr. Rojalina Priyadarshini


Associate Professor, Comp.Sc. & Engg.
Mail: [email protected]
7008730761, 9437937546
Module-3 Overview

SUPPORT FEATURE
Bayes Theorem
Topic VECTOR Topic SELECTION
MACHINE

2 3 4 5
1

NAÏVE BAYES DIMENSIONALITY


ALGORITHM Topic REDUCTION Topic
Topic
NAÏVE BAYES

▪ Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving
classification problems.
▪ Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
▪ Its an Eager algorithm so, can make quick predictions thus can be used to make real-time predictions.
▪ It is a probabilistic classifier.
▪ Ex: spam filtration, Sentimental analysis, and classifying articles.

Terminologies :
▪ Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the
occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then red,
spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to identify that it is
an apple without depending on each other.
▪ Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
NAÏVE BAYE’S EXAMPLE

▪ Bayes’s theorem is based on conditional probability


▪ Conditional Probability: Probability of one (or more)
event given the occurrence of another event,
▪ we are trying to find the probability of event A, given event B is true
▪ Here P(B) is called prior probability which means it is the probability of an event before the
evidence
▪ P(B|A) is called the posterior probability i.e., Probability of an event after the evidence is seen
Y: Class of the variable
X: Feature vector (of size n)
BAYE’S THEOREM

Assumptions of Naive Bayes:

· All the variables are independent.


· All the predictors have an equal effect on the outcome.

Problem:

From the given dataset to predict whether we can pet an


animal or not

Solution:

We need to find P(xi|yj) for each xi in X and each yj in Y


BAYE’S THEOREM

Finding the frequency table of each independent feature


BAYE’S THEOREM

We also need the probabilities (P(y)), which are


calculated in the table below

Now if we send our test data, suppose test = (Cow,


Medium, Black)
Probability of petting an animal :
BAYE’S THEOREM
BAYE’S THEOREM

Here P(Yes|Test) >


P(No|Test), so the
We know P(Yes|Test)+P(No|test) = 1 prediction that we
can pet this animal is
So, we will normalize the result: “Yes”
Naïve Bayes Algorithm

▪ a training data set of


weather and
corresponding target
variable ‘Play’ (suggesting
possibilities of playing)
▪ need to classify whether
players will play or not
based on weather
condition

STEPS
1. Convert the data set into frequency table Problem: Players will play if weather is sunny. Is this statement is
2. Create Likelihood table by finding the correct?
probabilities. P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
▪ Overcast probability = 0.29 and probability of Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36,
playing is 0.64. P( Yes)= 9/14 = 0.64
3. Now, use Naive Bayesian equation to calculate Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher
the conditional probability for each class. The class probability.
with the highest conditional probability is the
outcome of prediction.
Naïve Bayes Algorithm

Cons:
Pros: • If categorical variable has a category (in test data set),
• It is an eager learning algorithm. It is easy and fast which was not observed in training data set,
to predict class of test data set. then model will assign a 0 (zero) probability and will
be unable to make a prediction. This is often known
• It performs well in multi class prediction as “Zero Frequency”. To solve this, we can use the
• When assumption of independence holds, a Naive smoothing technique. One of the simplest smoothing
Bayes classifier performs better compare to techniques is called Laplace estimation.
other models like logistic regression and you need • Another limitation of Naive Bayes is the assumption
less training data. of independent predictors. In real life, it is almost
• It perform well in case of categorical input variables impossible that we get a set of predictors which are
compared to numerical variable(s). completely independent.
Support Vector Machine

▪ Support vector machines (SVMs) are a type of supervised machine


learning algorithms

▪ used for both for classification and regression.

▪ But generally, they are used in classification problems.

▪ In the SVM algorithm, we plot each data item as a point in n-


dimensional space (where n is a number of features you have) with the
value of each feature being the value of a particular coordinate.

▪ Then, we perform classification by finding the hyper-plane that


differentiates the two classes very well
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1 w x + b>0
denotes -1

How would you


classify this data?

w x + b<0
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

Any of these
would be fine..

..but which is
best?
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1

How would you


classify this data?

Misclassified
to +1 class
a
Classifier Margin
x f yest
f(x,w,b) = sign(w x + b)
denotes +1
denotes -1 Define the
hyperplane
(margin) of a
A hyperplane could linear classifier as
be thought of as a the width that
Line that linearly
the boundary
separates two
decision boundary could be
increased by
before hitting a
datapoint.
a
Maximum Margin
x f yest
1. A hyperplane with largest possible
margin could be a =
f(x,w,b) good classifier.
sign(w x + b)
denotes +1 2. Implies that only support vectors are
denotes -1 important; other training examples
are ignorable.
Support Vectors 3. Empirically it works very very well.
are those
datapoints that The maximal margin
are closer to the Hyperplane is the
hyperplane. linear classifier with
If these are the maximum margin.
removed the This is the simplest
position of the kind of SVM (Called an
hyperplane may LSVM)
change. So
critical Linear SVM
Terminologies: SVM

•Support Vectors − Datapoints that are closest to the hyperplane is


called support vectors. Separating line will be defined with the help of
these data points.

•Hyperplane − As we can see in the above diagram, it is a decision plane


or space which is divided between a set of objects having different
classes.

•Margin − It may be defined as the gap between two lines on the closet
data points of different classes. It can be calculated as the perpendicular
distance from the line to the support vectors. Large margin is considered
as a good margin and small margin is considered as a bad margin.
Types of SVM

•There are two types of SVM


•Linear
•Non-Linear

In the scenario below, we can’t have linear hyper-plane between the


two classes, so how does SVM classify these two classes?

SVM can solve this problem. Easily! It solves this problem by


introducing additional feature. Here, we will add a new feature
z=x^2+y^2. Now, let’s plot the data points on axis x and z:

In above plot, points to consider are:


•All values for z would be positive always because z is the squared sum of
both x and y
•In the original plot, red circles appear close to the origin of x and y axes,
leading to lower value of z and star relatively away from the origin result
to higher value of z.
Kernels in SVM

What is a Kernel:

▪ Kernel transforms an input data space into the required form.


▪ Kernel takes a low dimensional input space and transforms it into
a higher dimensional space.
▪ Kernel converts non-separable problems into separable problems
by adding more dimensions to it
Kernels in SVM

Linear Kernel:

▪ recommended for text classification because most of these types


of classification problems are linearly separable.
▪ Linear kernel works really well when there are a lot of features,
and text classification problems have a lot of features.
▪ Linear kernel functions are faster than most of the others and you
have fewer parameters to optimize.
f(X) = w^T * X + b

▪ w is the weight vector that you want to minimize, X is the data


that you're trying to classify, and b is the bias and the linear
coefficient estimated from the training data.
Polynomial Kernel:

▪ The polynomial kernel isn't used in practice very often because it isn't as computationally efficient
as other kernels.

▪ f(Xi, Xj) represents the polynomial decision boundary that will separate your
data. X1 and X2 represent your data and ‘d’ is the degree of the polynomial.

Gaussian Radial Basis Function (RBF)

▪ One of the most powerful and commonly used kernels in SVMs. Usually the choice for non-linear
data.
▪ It is a general-purpose kernel; used when there is no prior knowledge about the data. Equation is:

gamma is a parameter must be specified to the learning algorithm. A good default value of gamma is
0.1.

||X1 - X2|| is the dot product between your features.


Gaussian Kernel;

▪ One of the most powerful and commonly used kernels in SVMs. Usually the choice for non-linear
data.
▪ It is a general-purpose kernel; used when there is no prior knowledge about the data. Equation is:
Sigmoid
More useful in neural networks than in support vector machines, but there are
occasional specific use cases.
f(X, y) = tanh(alpha * X^T * y
+ C)
Why SVMs are so popular?

▪ SVMs have a clever way to prevent over fitting.


▪ We can work with relatively larger number of features without doing too much computation.
▪ Computation time depends on number of support vectors, which are usually small in number.
PRINCIPAL COMPONENT ANALYSIS

▪ Principal Component Analysis is an unsupervised learning algorithm that is used for the dimensionality reduction
in machine learning.

▪ Is a method for reducing the dimensionality of data.

▪ It can be thought of as a projection method where data with m-columns (features) is projected into a
subspace with m or fewer columns, while retaining the essence of the original data.

▪ PCA is an operation applied to a dataset, represented by an n x m matrix A that results in a projection of A


which we will call B. Let’s walk through the steps of this operation.

a11, a12
A= a21, a22
a31, a32

B = PCA(A)
PRINCIPAL COMPONENT ANALYSIS

a11, a12 Step1) The first step is to calculate the mean values of each column.
A= a21, a22 M = mean(A) OR
a31, a32 M(m11, m12) = (a11 + a21 + a31) / 3 (a12 + a22 + a32) / 3

▪ If all eigenvalues have a similar Step 2) Next, we need to center the values in each column by subtracting the mean
value, then the existing column value.
representation may already be C=A–M
reasonably compressed or dense
and that the projection is not
Step 3)The next step is to calculate the covariance matrix of the centered matrix C.
required.

▪ If eigenvalues close to zero, they V = cov(C)


represent components or axes of B
that may be discarded. Step 4) Finally, we calculate the eigen decomposition of the covariance matrix V. This
results in a list of eigenvalues and a list of eigenvectors.
▪ Ideally, we would select k values, vectors = eig(V)
eigenvectors, called principal
components, that have the k largest
Step 5) Sort the eigen values in descending order and select top k largest
eigenvalues.
components
PRINCIPAL COMPONENT ANALYSIS

a11, a12 Step1) The first step is to calculate the mean values of each column.
A= a21, a22 M = mean(A) OR
a31, a32 M(m11, m12) = (a11 + a21 + a31) / 3 (a12 + a22 + a32) / 3

▪ If all eigenvalues have a similar Step 2) Next, we need to center the values in each column by subtracting the mean
value, then the existing column value.
representation may already be C=A–M
reasonably compressed or dense
and that the projection is not
Step 3)The next step is to calculate the covariance matrix of the centered matrix C.
required.

▪ If eigenvalues close to zero, they V = cov(C)


represent components or axes of B
that may be discarded. Step 4) Finally, we calculate the eigen decomposition of the covariance matrix V. This
results in a list of eigenvalues and a list of eigenvectors.
▪ Ideally, we would select k values, vectors = eig(V)
eigenvectors, called principal
components, that have the k largest
Step 5) Sort the eigen values in descending order and select top k largest
eigenvalues.
components
Business Applications of K-Means
Clustering
• Clustering is a very powerful technique and has broad applications in various industries ranging from media to healthcare,
manufacturing to service industries
1. Customer Segmentation
Customer Segmentation
Customers are categorized by using clustering algorithms according to their purchasing behavior or interests to develop focused
marketing campaigns.
Imagine you have 10M customers, and you want to develop customized or focused marketing campaigns. It is unlikely that you
will develop 10M marketing campaigns, so what do we do? We could use clustering to group 10M customers into 25 clusters
and then design 25 marketing campaigns instead of 10M.

You might also like