100% found this document useful (1 vote)
6 views81 pages

Unit 2 PPT - Part 2

The document discusses the EM algorithm, which estimates joint probability distributions in datasets with missing data, and its applications in various fields such as data clustering and natural language processing. It also covers Support Vector Machines (SVM), a supervised learning model used for classification and regression, detailing its mechanism for finding optimal hyperplanes and the significance of support vectors. Additionally, the document explains kernel functions in SVM, their types, advantages, and disadvantages, as well as the concept of non-linear SVMs for handling complex datasets.

Uploaded by

Aman Bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
6 views81 pages

Unit 2 PPT - Part 2

The document discusses the EM algorithm, which estimates joint probability distributions in datasets with missing data, and its applications in various fields such as data clustering and natural language processing. It also covers Support Vector Machines (SVM), a supervised learning model used for classification and regression, detailing its mechanism for finding optimal hyperplanes and the significance of support vectors. Additionally, the document explains kernel functions in SVM, their types, advantages, and disadvantages, as well as the concept of non-linear SVMs for handling complex datasets.

Uploaded by

Aman Bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Machine Learning

(Unit 2 - Part 2)
In statistic modeling a common problem arises as to how can we
estimate the joint probability distribution for dataset.
What is EM Algorithm?
• EM algorithm was proposed in 1997 by Arthur Dempster.
• It is basically used to find the local maximum likelihood
parameters of a statistical model in case of latent variables
are present for the data is missing or incomplete.
Applications of EM
Algorithm
Data Clustering in Machine Learning and Computer Vision
Used in Natural Language Processing
Used in Parameter Estimation in Mix Models and
Quantitative Genetics
Used in Psychometrics
Used in Medical Image Reconstruction, Structural Engineering
Support Vector Machine
Support vector Machine (SVM)
❑ SVM is based on statistical learning theory.
❑ A support-vector machines are supervised learning models with associated learning
algorithms that analyze data used for classification and regression analysis.
❑ SVM involve detection of hyperplanes which segregate data into classes.
❑ Support vectors are the data points that lie closest to the decision surface (or
hyperplane).
❑ SVMs are very versatile and are also capable of performing linear or nonlinear
classification, regression, and outlier detection.
Two Class Problem: Linear Separable Case

❑ Linearly separable
Class 1
binary sets
Denotes +1
Denotes -1
❑ Many decision
boundaries can
separate these two
Class 2 classes.
Which one should we
choose?
Classifier Margin

Denotes +1
Denotes -1
Define the margin of a
linear classifier as the width
that the boundary could be
increased by before hitting
a data point.
Good Decision Boundary: Margin Should Be Large

f(x,w,b) = sign(w. x - b)
Denotes +1
Denotes -1 The maximum margin linear
classifier is the linear classifier
with the maximum margin.
This is the simplest kind of
SVM (called an Linear SVM).

Support Vectors
are those data
points that the
margin pushes
up against
How Does it Works
Identify the right hyper-plane (Scenario-1):

Thumb rule to identify the


right hyper-plane: “Select
the hyper-plane which
segregates the two classes
better”.

In this scenario,
hyper-plane “B”
has excellently performed
this job.
How Does it Works
Identify the right hyper-plane (Scenario-2):

Maximizing the distances


between nearest data
point (either class) and
hyper-plane will help us to
decide the right
hyper-plane.

This distance is called


as Margin.
How Does it Works
Identify the right hyper-plane (Scenario-2):

The margin for hyper-plane


C is high as compared to
both A and B.

Hence, we name the right


hyper-plane as C.

Another lightning reason


for selecting the
hyper-plane with higher
margin is robustness.
How Does it Works
Identify the right hyper-plane (Scenario-3):

SVM selects the


hyper-plane which
classifies the classes
accurately prior
to maximizing margin.

Here, hyper-plane B has a


classification error and A
has classified all correctly.

Therefore, the right


hyper-plane is A.
How Does it Works
Identify the right hyper-plane (Scenario-3):

SVM selects the


hyper-plane which
classifies the classes
accurately prior
to maximizing margin.

Here, hyper-plane B has a


classification error and A
has classified all correctly.

Therefore, the right


hyper-plane is A.
How Does it Works
Can we classify two classes (Scenario-4)?

The SVM algorithm has a feature to ignore outliers


and find the hyper-plane that has the maximum
margin.

Hence, we can say, SVM classification is robust to


outliers.
How Does it Works
Find the hyper-plane to segregate to classes (Scenario-5)

All values for z would be


positive always because z is
the squared sum of both x
and y

In the original plot, red circles


appear close to the origin of x and
It solves this problem by introducing additional y axes, leading to lower value of z
feature. and star relatively away from the
origin result to higher value of z.
Here, we will add a new feature z=x^2+y^2.

Now, let’s plot the data points on axis x and z


What is SVM?
Support vector machines so called as SVM is a supervised learning algorithm

Can be used for classification and regression problems as support vector classification
(SVC) and support vector regression (SVR).

It is used for smaller dataset as it takes too long to process.


The ideology behind SVM
SVM is based on the idea of finding a hyperplane

that best separates the features into different domains


Intuition development
There is a stalker who is sending you emails and now you want to design a function(
hyperplane ) which will clearly differentiate the two cases, such that whenever you
received an email from the stalker it will be classified as a spam. The following are the
figure of two cases in which the hyperplane are drawn, which one will you pick and why?
Terminologies used in SVM
The points closest to the hyperplane are called as the support vector points and

The distance of the vectors from the hyperplane are called the margins.

SV points are very critical in determining the


hyperplane because if the position of the
vectors changes the hyperplane’s position is
altered.

Technically this hyperplane can also be called as


margin maximizing hyperplane.
Hyperplane (Decision surface)
The hyperplane is a function which is used to differentiate between features.

In 2-D, the function used to classify between features is a line, whereas

The function used to classify the features in a 3-D is called as a plane

Similarly, the function which classifies the point in higher dimension is called as a
hyperplane.
Hyperplane (Decision surface)
Let’s say there are “m” dimensions

thus the equation of the hyperplane in the ‘M’ dimension can be given as =

where,
Wi = vectors(W0, W1, W2, W3……Wm)
b = biased term (W0)
X = variables.
Hard margin SVM
Assume 3 hyperplanes namely (π, π+, π−) such that ‘π+’ is parallel to ‘π’ passing through
the support vectors on the positive side and ‘π−’ is parallel to ‘π’ passing through the
support vectors on the negative side.

the equations of each hyperplane can be considered as:


Hard margin SVM
for the point X1 :

for the point X3 :

for the point X4 :

for the point X6 :


Hard margin SVM
Let’s look into the constraints which are not classified:

So we can see that if the points are linearly separable then only our hyperplane is able to
distinguish between them and if any outlier is introduced then it is not able to separate
them.

So these type of SVM is called as hard margin SVM


Support Vector Kernels
❑ The linear classifier relies on an inner product between vectors
K(xi,xj)=xiTxj
❑ If every data point is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
❑ A kernel function is some function that corresponds to an inner product in some
expanded feature space.
Why use kernels?
Make non-separable problem separable.
Map data into better representational space
SVM Kernel Functions
❑ SVM algorithms use a set of mathematical functions that are defined as the
kernel. The function of kernel is to take data as input and transform it into the
required form.
❑ Different SVM algorithms use different types of kernel functions. These
functions can be different types. For example linear, nonlinear, polynomial,
radial basis function (RBF), and sigmoid.
❑ The kernel functions return the inner product between two points in a suitable
feature space. Thus by defining a notion of similarity, with little computational
cost even in very high-dimensional spaces.
Support Vector Kernel
Types of kernels:
Linear Kernel

Polynomial Kernel

Radial Basis Function Kernel (RBF) / Gaussian Kernel


.
Kernel Functions
❑ Linear Kernel: K(X,Y)=XTY + c

❑ Polynomial kernel: K(X,Y)=(γ⋅XTY+r)d,γ>0

❑ Radial basis function (RBF) Kernel: K(X,Y)=exp(∥X−Y∥2/2σ2) which in


simple form can be written as exp(−γ⋅∥X−Y∥2),γ>0
Data representation using kernels
Pros of SVM
It is really effective in the higher dimension.

Effective when the number of features are more than training examples.

Best algorithm when classes are separable

The hyperplane is affected by only the support vectors thus outliers have less impact.

SVM is suited for extreme case binary classification.


Cons of SVM
For larger dataset, it requires a large amount of time to process.

Does not perform well in case of overlapped classes.

Selecting, appropriately hyperparameters of the SVM that will allow for sufficient
generalization performance.

Selecting the appropriate kernel function can be tricky.


Applications
Definition
❑ Margin of Separation (d): the separation between the hyperplane and the closest
data point for a given weight vector w and bias b.
❑ Optimal Hyperplane (maximal margin): the particular hyperplane for which the
margin of separation d is maximized.
❑ Thus, this can be written as : H1
H
T
w xi + b ≥ 0 for di = +1 H2
wT xi + b < 0 for di = –1 d+
d-
Contents
❑ Non- Linear SVM
▪ Non- Linear SVM : Feature Space
▪ Transformation to Feature Space
❑ SVM kernel functions
❑ Applications
Non-Linear SVM

❑ The idea is to gain linearly separation by mapping the data to a higher


dimensional space.
❑ Datasets that are linearly separable (with some noise) work out great

x
❑ But what are we going to do if the dataset is just too hard?
0

❑ How about … mapping data to a higher-dimensional


x space
x2

0 x
Non-Linear SVM : Feature Space
❑ General idea: the original input space (x) can be mapped to some
higher-dimensional feature space (φ(x) )where the training set is separable:
x=(x1,x2) √2x1x2

Φ: x → φ(x)

x22

x12
Transformation to Feature Space
❑ Possible problem of the transformation
❑ High computation burden due to high-dimensionality and hard to get a good
estimate
❑ SVM solves these two issues simultaneously
❑ “Kernel tricks” for efficient computation
❑ Minimize ||w||2 can lead to a “good” classifier
φ( φ(
φ( ) φ( ) φ( φ(
φ )
φ( φ( ) φ( ) φ( ) φ(
) φ( ) φ( ) φ(φ( ) )
(.) ) φ(
) φ( )
) φ( )
)
Feature
Input )
space space space
Key idea: transform xi to a higher dimensional
How to calculate the distance from a point to a line?
Form of equation defining the decision
surface separating the classes is a
hyperplane of the form:
wx +b = 0
W x X – Vector
W – Normal Vector
b – bias

What is the distance expression for a


point x to a line wx+b= 0?
Thank
You

You might also like