0% found this document useful (0 votes)
40 views118 pages

Unit Ii

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views118 pages

Unit Ii

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 118

MACHINE LEARNING

• Prerequisite Courses, if any:


• Basics of Statistics, Linear Algebra, Calculus, Probability Companion Course, if any:
Artificial Intelligence, Deep Learning

• Course Objectives:
1. To understand the basic concepts of machine learning and apply them for the various
problems.
2. To learn various machine learning types and use it for the various machine learning tasks.
3. To optimize the machine learning model and generalize it.

• Course Outcomes:
On completion of the course, students will be able to–
CO1: Apply basic concepts of machine learning and different types of machine learning
algorithms.
CO2 : Compare different types of classification models and their relevant application
CO3 : Differentiate various regression techniques and evaluate their performance.
CO4: Illustrate the tree-based and probabilistic machine learning algorithms
CO5: Identify different unsupervised learning algorithms for the related real world problems
CO6: Apply fundamental concepts of ANN.
Unit II CLASSIFICATION

SYLLABUS

Binary Classification: Linear Classification model, Performance Evaluation- Confusion


Matrix, Accuracy, Precision, Recall, ROC Curves, F-Measure

Multi-class Classification: Model, Performance Evaluation Metrics – Per-class Precision


and Per-Class Recall, weighted average precision and recall -with example, Handling more
than two classes, Multiclass Classification techniques -One vs One, One vs Rest

Linear Models: Introduction, Linear Support Vector Machines (SVM) – Introduction, Soft
Margin SVM,Introduction to various SVM Kernel to handle non-linear data – RBF,
Gaussian, Polynomial, Sigmoid.

Logistic Regression – Model, Cost Function.


Binary Classification: Linear Classification Model
Linear Classification model

• Classification : The process of recognition, understanding, and grouping of


objects and ideas into preset categories a.k.a “sub-populations.”
• It can be performed on both structured or unstructured data.
• The process starts with predicting the class of given data points.
• The classes are often referred to as target, label or categories.

• Classification algorithms applied to the training data find the same pattern (similar number
sequences, words or sentiments, and the like) in future data sets.

• Sentiment analysis - used for categorizing unstructured text by opinion polarity (positive,
negative or neutral)
Binary Classification: Linear Classification Model
Classification Terminologies In Machine Learning
• Classifier – It is an algorithm that is used to map the input data to a specific category.

• Classification Model – The model predicts or draws a conclusion to the input data given for
training, it will predict the class or category for the data.

• Feature – A feature is an individual measurable property of the phenomenon being observed.

• Binary Classification – It is a type of classification with two outcomes, for eg – either true
or false.

• Multi-Class Classification – The classification with more than two classes, in multi-class
classification each sample is assigned to one and only one label or target.

• Multi-label Classification – This is a type of classification where each sample is assigned to


a set of labels or targets.
Binary Classification: Linear Classification Model
• Initialize – It is to assign the classifier to be used for the

• Train the Classifier – Each classifier in sci-kit learn uses the fit(X, y) method to fit the
model for training the train X and train label y.

• Predict the Target – For an unlabeled observation X, the predict(X) method returns predicted
label y.

• Evaluate – This basically means the evaluation of the model i.e classification report,
accuracy score, etc.
Binary Classification: Linear Classification Model
Linear separability: A dataset is linearly separable if there is at least one line that
clearly distinguishes the classes.

Non-linear separability: A dataset is said to be non-linearly separable if there


isn’t a single line that clearly distinguishes the classes.
Binary Classification: Linear Classification Model
• In classification algorithm, a discrete output function(y) is mapped to input
variable(x)

y=f(x), where y = categorical output

• Examples
 Email Spam Detector
 A handwritten character, classify it as one of the known characters.
 A patient diagnosed with a disease or not
Binary Classification: Linear Classification Model
• In classification algorithm, a discrete output function(y) is mapped to input
variable(x)

y=f(x), where y = categorical output

• Examples
 Email Spam Detector
 A handwritten character, classify it as one of the known characters.
 A patient diagnosed with a disease or not
Performance Evaluation: Confusion Matrix
• In any binary classification task,model can only achieve two results, either our
model is correct or incorrect in the prediction where we only have two classes.
Few
• In classification algorithm, a discrete output function(y) is mapped to input
variable(x)

y=f(x), where y = categorical output

• Examples
 Email Spam Detector
 A handwritten character, classify it as one of the known characters.
 A patient diagnosed with a disease or not
Few Definitions
• The objects of interest in machine learning are usually referred to as instances.
• The set of all possible instances is called the instance space, denoted X
• label space L and the output space Y
• Model: a mapping from the instance space to the output space
• In classification the output space is a set of classes, while in regression it is the
set of real numbers.
• In order to learn such a model we require a training set Tr of labelled instances
(x,l(x)), also called examples, where l : X → L is a labelling function
• Some of the labelled data is usually set aside for evaluating or testing a
classifier, in which case it is called a test set and denoted by Te. We use
superscripts to restrict training or test set to a particular class:
e.g., Te⊕ = {(x,l(x))|x ∈ Te, l(x) = ⊕}
is the set of positive test examples, and Te is the set of negative test
examples

• Number of features, also called attributes, predictor variables, explanatory


variables or independent variables. Indicating the set of values or domain of a
feature by Fi ,
X = F1 ×F2 ×...×Fd , and thus every instance is a d-vector of feature values
Classification
input attribute
Price x1 (e.g., in U.S. dollars)
engine power as the second attribute x2
(e.g., engine volume in cubic centimeters).

Thus we represent each car using two numeric values


x1
X=
x2

and its label denotes its type

1 if x is a positive example
L=
0 if x is a negative example

Each car is represented by such an ordered pair (x, r) and the training
set contains N such examples
Classification
(p1 ≤ price ≤ p2) AND (e1 ≤ engine power ≤ e2)

1 if h classifies x as a positive
h(x) =
0 if h classifies x as a negative example

In real life we do not know C(x), so we cannot


evaluate how well h(x) matches C(x).

The empirical error is the proportion of


training instances where predictions of h do
not match the required values given in X.
The error of hypothesis h given the training
set X is N
N
E(h|X) = ∑ (h(xt ) ! = lt )
t=1
where (a = b) is 1 if a != b and is 0 if a = b
Classification
• A classifier is a mapping cˆ : X → C ,
where C = {C1,C2,...,Ck } is a finite and usually small set of
class labels.
Ci to indicate the set of examples of that class.
cˆ(x) is an estimate of the true but unknown function c(x).
Examples for a classifier take the form (x,c(x)), where x ∈ X is an
instance and c(x) is the true class of the instance. Learning a
classifier involves constructing the function cˆ such that it matches
c as closely as possible (and not just on the training set, but ideally
on the entire instance space X ). In the simplest case we have only
two classes which are usually referred to as positive and negative,
⊕and , or +1 and −1. Two-class classification is often called binary
classification (or concept learning, if the positive class can be
meaningfully called
Classification
Assessing classification performance
• contingency table or confusion matrix
Performance Evaluation: Confusion Matrix
Confusion Matrix
• Evaluation of the performance of a classification model is based on the counts
of test records correctly and incorrectly predicted by the model.
• The confusion matrix provides a more insightful picture.
• It not only shows the performance of a predictive model, but also which
classes are being predicted correctly and incorrectly, and what type of errors are
being made.
Performance Evaluation: Confusion Matrix
Confusion Matrix
Performance Evaluation: Confusion Matrix
Confusion Matrix

True Positive:
Interpretation: You predicted positive and it’s true.
You predicted that a woman is pregnant and she actually is.
True Negative:
Interpretation: You predicted negative and it’s true.
You predicted that a man is not pregnant and he actually is not.
False Positive: (Type 1 Error)
Interpretation: You predicted positive and it’s false.
You predicted that a man is pregnant but he actually is not.
False Negative: (Type 2 Error)
Interpretation: You predicted negative and it’s false.
You predicted that a woman is not pregnant but she actually is.
Performance Evaluation: Confusion Matrix
Confusion Matrix
1.True Positive Rate (TPR)
(Sensitivity or Recall)
TPR = TP/TP+FN
2. True Negative Rate (TNR)
(Specificity)
TNR = TN/TN+FP
3. False Positive Rate
FPR = FP/TN+FP
4. False Negative Rate
FNR=FN/TP +FN
5. Positive Predictive Value (PPV) Precision
PPV=TP/TP+FP
6. Negative Predictive Value (NPV)
NPV = TN/TN+FN
7. False Discovery Rate
FDR=FP/FP+TP
8.False Omission Rate
FOR=FN/FN+TN
Performance Evaluation: Confusion Matrix
Confusion Matrix

Accuracy and Error Rate


Accuracy:
acc=TP+TN / N
or
acc = TP+TN / TP+FP+TN+FN
Fraction of correctly classified examples
Error Rate:
error=FN+FP/N=1−acc
Fraction of misclassified examples
Performance Evaluation: Confusion Matrix
Recall versus Precision

Precision is the ratio of True Positives to all the positives predicted by the model.
Low precision: the more False positives the model predicts, the lower the precision.

Recall (Sensitivity)is the ratio of True Positives to all the positives in your Dataset.
Low recall: the more False Negatives the model predicts, the lower the recall .
Performance Evaluation: Confusion Matrix
Recall versus Precision

In case 1, which scenario do you think will have the highest cost?

Imagine that if we predict COVID-19 residents as healthy patients and they do not need to
quarantine, there would be a massive number of COVID-19 infections. The cost of false
negatives is much higher than the cost of false positives.
Performance Evaluation: Confusion Matrix
Recall versus Precision

In case 2, which scenario do you think will have the highest cost?

Missing important emails will clearly be more of a problem than receiving spam, we can say
that in this case, FP will have a higher cost than FN.
Performance Evaluation: Confusion Matrix
Recall versus Precision

In case3. which scenario do you think will have the highest cost?

The banks would lose a bunch amount of money if the actual bad loans are predicted as good
loans due to loans not being repaid. On the other hand, banks won't be able to make more
revenue if the actual good loans are predicted as bad loans. Therefore, the cost of False
Negatives is much higher than the cost of False Positives.
Performance Evaluation: Confusion Matrix

In practice, the cost of false negatives is not the same as the cost of false
positives, depending on the different specific cases. It is evident that not only
should we calculate accuracy, but we should also evaluate our model using other
metrics, for example, Recall and Precision.
Performance Evaluation: Confusion Matrix
Ex 1. Dataset contains 33 spam,67 ham mails.When classifier is trained it has
predicted correctly 27 spam mails and 57 ham mails. Find the confusion matrix
& all evaluation metrices
Create confusion matrix , find accuracy,recall,precision.

Ex 2
Total of 10 cats and dogs and our model predicts whether it is a cat or not.
Actual values = [‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’]
Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘cat’, ‘cat’,
‘cat’]
Cat = positive Dog= Negative

Create confusion matrix , find accuracy,recall,precision

Ex 2. 165 patients were being tested for the presence of that disease.Out of those
165 cases, the classifier predicted "yes" 110 times, and "no" 55 times.In reality,
105 patients in the sample have the disease, and 60 patients do not.Find the
confusion matrix & all evaluation metrices
Actual

Positive Negative

Positive
P
R
E
D
I
C
T
E
D

Negative
Performance Evaluation: F1-Score:

• Consider fraud detection model giving the probability [0.0–1.0] that a


transaction is fraudulent.
• If this probability is below 0.5, you classify the transaction as non-
fraudulent; otherwise, you classify the transaction as fraudulent.
• To evaluate model consider 10,000 manually classified transactions, with
300 fraudulent transaction and 9,700 non-fraudulent transactions and
confusion matrix for
Performance Evaluation: F1-Score:

Out of the 300 fraudulent transactions, only 100 fraudulent transactions are classified
correctly.Classifier missed 200 out of the 300 fraudulent transactions!

• Another typer of Classifier

Accuracy can be misleading for imbalanced data


Performance Evaluation: F1-Score:
What percent of the positive (fraudulent)?

• The classier caught 33.3% of the fraudulent transactions.


What percent of positive (fraudulent) predictions were correct?

• The classier caught 12.5% of the time your classifier is correct.


Performance Evaluation: F1-Score:
• F1 Score combines Recall and Precision to one performance metric.
• F1 Score is the weighted average of Precision and Recall.
• This score takes both false positives and false negatives into account.
• F1 is usually more useful than Accuracy, especially if you have an uneven class
distribution.

• Differences between the F1-score and the accuracy,


• Accuracy is used when the True Positives and True negatives are more important
while F1-score is used when the False Negatives and False Positives are crucial
• Accuracy can be used when the class distribution is similar while F1-score is a
better metric when there are imbalanced classes as in the above case.
• In most real-life classification problems, imbalanced class distribution exists and
thus F1-score is a better metric to evaluate our model on.
Performance Evaluation: ROC Curve
• The overall performance of a classifier, summarized over all possible thresholds,
is given by the Receiver Operating Characteristics (ROC) curve.
• ROC Curves are used to see how well your classifier can separate positive and
negative examples and to identify the best threshold for separating them.
• e.g. After training the model the values of the middle column (True Label) are
either zero (0) for non-fraudulent transactions or one (1) for fraudulent
transactions, and the last column (Fraudulent Prob) is the probability that the
transaction is fraudulent:
Performance Evaluation: ROC Curve
• Consider 0.5 threshold

• Consider 0.1 threshold


Performance Evaluation: ROC Curve
• To derive the ROC curve, calculate the True Positive Rate (TPR) and the False
Positive Rate (FPR), starting by setting the threshold to 1.0, where every
transaction with a Fraudulent Prob of less than 1.0 iclassified as non-fraudulent
(0).
• The column “T=1.0” shows the predicted class labels when the threshold is 1.0:
Performance Evaluation: ROC Curve
• The confusion matrix for the Threshold=1.0 case:

ROC curve is created by plotting the True Positive Pate (TPR) against the False Positive
Rate (FPR) at various threshold settings, so you calculate both:
Performance Evaluation: ROC Curve
• The confusion matrix for the Threshold=9.0 case:

The confusion matrix for Threshold=0.9:


Performance Evaluation: ROC Curve
• The ROC curve for the Threshold=9.0 case:
Performance Evaluation: ROC Curve
• The ROC curve for the Threshold=9.0 case:
Performance Evaluation: ROC Curve
AUC (Area Under the Curve)
• The model performance is determined by looking at the area under the ROC
curve (or AUC).
• An excellent model has AUC near to the 1.0, which means it has a good measure
of separability.
• For above model, the AUC is the combined are of the blue, green and purple
rectangles, so the AUC = 0.4 x 0.6 + 0.2 x 0.8 + 0.4 x 1.0 = 0.80.
Performance Evaluation: ROC
ROC (Receiver Operating Characteristic)
• ROC Curves summarize the trade-off between the true positive rate and false
positive rate for a predictive model using different probability thresholds.
• AUC-ROC curve is one of the most commonly used metrics to evaluate the
performance of machine learning algorithms.
Multi-class Classification: Model
Multiclass Classification:
• Classification means categorizing data and forming groups based on the
similarities.
• In multiclass classification, we have more than two classes.
Multi-class Classification: Model
Multiclass Classification Algorithms:
Multi-class Classification
Multiclass Classification Strategies:
OneVsOne : N-class instances then N* (N-1)/2 binary classifier models
Example
Consider three classes — blue, red, green . Then one-vs-one approach splits it into
the following 3 binary classification datasets:

n*(n-1)/ 2 = 3*(3-1)/2 =3
binary classifiers have to be
generated
Multi-class Classification:
Multiclass Classification Strategies:
OneVsRest:
Multi-class Classification

examples
label Same setup where we have a set
apple
of features for each example
orange

Rather than just two labels, now


have 3 or more
apple

banana

banana

pineapple

46
Multi-class Classification

Hard to separate three classes with just one line 


47
Multi-class Classification

48
Multi-class Classification

OVA: linear classifiers (e.g. perceptron)

banana vs. not

pineapple vs. not

49
apple vs. not
Multi-class Classification

OVA: linear classifiers (e.g. perceptron)

banana vs. not

pineapple vs. not

How do we classify?
50
apple vs. not
Multi-class Classification

Approach 2: All vs. all (AVA) OR One Vs One


Training:
For each pair of labels, train a classifier to distinguish between them

for i = 1 to number of labels:

for k = i+1 to number of labels:


train a classifier to distinguish between labelj and labelk:
- create a dataset with all examples with labelj labeled
positive and all examples with labelk labeled negative
- train classifier on this subset of the data
51
Multi-class Classification
Multi-class Classification
Multi-class Classification
Macroaveraging vs. microaveraging

microaveraging: average over examples (this is the


“normal” way of calculating)

macroaveraging: calculate evaluation score (e.g.


accuracy) for each label, then average over labels
Multi-class Classification
Multi-class Classification
Performance Evaluation Metrices : Confusion Matrix

Multiclass Classification Algorithms:

Example : Total Number of instances = 25


Performance Evaluation Metrices : Per Class Precision & Recall
Multiclass Classification Algorithms: Example : Total Number of instances = 25

Precision for Cat : TP/ TP+FP = 4/4+9 = 4/13 = .308


Recall for Cat: TP/TP+FN = 4/ 4+2 = 4/6 = 0.66
Performance Evaluation Metrices : Per Class Precision & Recall
Multiclass Classification Algorithms: Example : Total Number of instances = 25

Precision for Fish : 2/2+1 = 2/3 = 0.66


Recall for Fish : 2/2+6+2=2/10 =0.200
Performance Evaluation Metrices : Per Class Precion & Recall

Multiclass Classification Algorithms: Example : Total Number of instances = 25

Precision for Hen: TP/TP+FP = 6/6+3 = 6/9 = 0.66


Recall for Hen: TP/TP+FN = 6/6+3 = 6/9 = 0.66
Performance Evaluation Metrices : Per Class Precision & Recall
Ex1. Find Precision and Recall for every class
Actual Class

Predicted C1 C2 C3
Class C1 15 7 2

C2 2 15 3

C3 3 8 45
Basic principles of classification
Basic principles of classification

•All objects before the coast line are boats and all objects after the coast line
are houses.
•Coast line serves as a decision surface that separates two classes.
Basic principles of classification

Then the algorithm seeks to find a decision surface that separates


classes of objects
Basic principles of classification

Unseen (new) objects are classified as “boats” if they fall below the
decision surface and as “houses” if the fall above it
Basic principles of classification

These boats will be misclassified as houses


SVM
• Support vector machines (SVMs) is a binary classification algorithm
• Support Vector Machine (SVM) is a supervised learning algorithm
developed by Vladimir Vapnik
• SVM has successful applications in many complex, real-world problems
such as text and image classification, hand-writing recognition, data mining,
bioinformatics, medicine and biosequence analysis and even stock market

Consider example dataset described by 2 genes, gene X and gene Y.


Represent patients geometrically (by “vectors”)
SVM

Find a linear decision surface (“hyperplane”) that can separate patient classes and has the
largest distance (i.e., largest “gap” or “margin”) between border-line patients (i.e., “support
vectors”);
SVM
1. Linear SVM – Hard Margin Classifier
Used for perfectly separated dataset ( linear
classification ),called “Linear SVM – Hard Margin
Classifier”.

2. Linear SVM – Soft Margin Classifier


Concept of Hard Margin Classifier is extended for
dataset where there are some outliers.
In this case all of the data points cant be separated using
a straight line, there will be some miss-classified points.

3. Non – Linear SVM


Non-linear SVM using kernel.
SVM

• Support vectors are the samples closest to the


separating hyperplane.

gin
These are ar
M
Support d

d
Vectors xi

d
SVM

Separating Hyperplanes
yi  1 Which one should we
yi  1 choose!

Yes, There are many possible separating hyperplanes


It could be this one or this or this or maybe….!
SVM
• The goal of a support vector machine is to find the optimal separating
hyperplane which maximizes the margin of the training data.

What is a separating hyperplane?


What is a separating hyperplane? SVM

A hyperplane is a generalization of a plane.

in one dimension, a hyperplane is called a point


in two dimensions, it is a line
in three dimensions, it is a plane
in more dimensions you can call it an hyperplane
What is the optimal separating hyperplane?SVM
What is the optimal separating hyperplane? SVM
Select an hyperplane as far as possible fromSVM
data points from each category:

Objective of a SVM is to find the optimal separating hyperplane:

because it correctly classifies the training data


because it is the one which will generalize better with unseen data
SVMthe optimal hyperplane?
What is the margin and how does it help choosing

If an hyperplane is very close to a data point, its margin will be small.


The further an hyperplane is from a data point, the larger its margin will be.
This means that the optimal hyperplane will be the one with the biggest margin.

That is why the objective of the SVM is to find the optimal separating hyperplane which
maximizes the margin of the training data.
Support VECTOR Machine Mathematics Behind SVM
• Basics of VECTOR:
• Definition: A vector is an object that has both a magnitude and a direction

1) The magnitude
• The magnitude or length of a vector x is written ||x|| and is called its norm.

• e.g. vector OA, ||OA|| is the length of the segment OA

Distance OA using Pythagoras' theorem


OA2=OB2+AB2
Support VECTOR Machine Mathematics Behind SVM
2) The direction
• The direction is the second component of a vector.
• Definition : The direction of a vector u(u1,u2) is the vector w(u1/||u||,u2/ ||u||)

vector u(u1,u2) with u1=3 and u2=4

Computing the direction vector


cos(θ)=u1/||u|| =3/5=0.6
and

cos(α)=u2/||u||=4/5=0.8
The direction of u(3,4) is the vector
w(0.6,0.8)

3) The dot product


• If two vectors x and y and there is an angle θ (theta) between them, their dot product
is : x⋅y=||x|| ||y|| cos(θ)
Support VECTOR Machine Mathematics Behind SVM
4) The orthogonal projection of a vector
The vector z=(u⋅x)u is the orthogonal projection of x onto y.
Given two vectors x and y, find the orthogonal projection of x onto y.

If vector u as the
direction of y then

u=y / ||y||
Compute the margin of the hyperplane SVM

Hyperplane, which separates two group of Compute distance between the point
data A(3,4) and the hyperplane.
Compute the margin of the hyperplane SVM

.
point A as a vector from the origin to A.
Take projection of vector A on unit vector w
Compute the margin of the hyperplane SVM

.
point A as a vector from the origin to A.
Take projection of vector A on unit vector w
Compute the margin of the hyperplane SVM

distance between the point A(3,4) and the hyperplane???


same as ||p||
Compute the margin of the hyperplane SVM

1. w=(2,1) which is normal to the hyperplane, and a=(3,4) which is the vector between
the origin and A.
||w|| =

2. vector u be the direction of w


u=(2/ )
.
3. p is the orthogonal projection of a onto w so
p=(u.a)u
||p||= 2
Distance ||p|| between A and the hyperplane, the margin is defined by :
margin=2||p||
Maximize the distance between the twoSVM
hyperplanes

.
Maximize the distance between the twoSVM
hyperplanes

.
Maximize the distance between the twoSVM
hyperplanes

.
SVM
How to find the distance between the two hyperplanes

.
SVM

found the couple (w,b) for which ||w|| is the smallest possible and the constraints we
fixed are met. Which means we will have the equation of the optimal hyperplane !

We find w and b by solving the following objective function using Quadratic Programming.
Linear SVM : Soft Margin Classifier
• An ideal SVM analysis should produce a hyperplane that completely separates
the vectors (cases) into two non-overlapping classes.
• However, perfect separation may not be possible, or it may result in a model
with so many cases that the model does not classify correctly.
• In this situation SVM finds the hyperplane that maximizes the margin and
minimizes the misclassifications.
Hard Margin Classifier wont work due to the inequality constraint yi(wTxi+1)≥1
Linear SVM : Soft Margin Classifier

Hard Margin Classifier wont work due to the inequality constraint yi(wTxi+1)≥1

The algorithm tries to maintain the slack variable to zero while maximizing margin.
However, it does not minimize the number of misclassifications (NP-complete
problem) but the sum of distances from the margin hyperplanes.
Linear SVM : Soft Margin Classifier
The Slack Variable indicates how much the point can violate the margin.
The Slack Variable helps to define 3 types of data points:

• if ξ=0 then the corresponding point ξ is on the margin or further away.


• if 0<ξ<1 then the point ξ is within the margin and classified correctly (Correct
side of the hyperplane).
• If ξ≥1 then the point is misclassified and present at the wrong side of the
hyperplane.

The HyperParameter C is also called as Regularization Constant.

• If C→0, then the loss is zero and we are trying to maximize the margin.
• If C→∞ then the margin does not have any effect and the objective function tries
to just minimize the loss.
• In other words, the Hyper Parameter C controls the relative weighting between
the twin goals of making margin large and ensures that most examples have
functional margin at least 1.
Non-Linear SVM

When it is almost difficult to separate non-linear classes, then apply another trick
called kernel trick that helps handle the data.
Non-linear SVMs

 General idea: the original input space can always be mapped


to some higher-dimensional feature space where the training
set is separable:

Φ: x → φ(x)
SVM

• If such linear decision surface does not exist, the data is mapped into a much higher
dimensional space (“feature space”) where the separating decision surface is found;
•The feature space is constructed via very clever mathematical projection (“kernel trick”).
The “Kernel Trick”
• Kernel trick:The kernel function transform the data into a higher dimensional
feature space to make it possible to perform the linear separation.

• Linear: The linear kernel does not transform the data at all. Therefore, it can be expressed
simply as the dot product of the features:

• Polynomial of power p: The polynomial kernel of degree d adds a


simple non-linear transformation of the data

• Gaussian (radial-basis function):. The RBF kernel performs well on many types
of data and is thought to be a reasonable starting point for many learning
tasks
The “Kernel Trick”

• Example Polynomial Kernel Function:

Existing Feature: X = np.array([-2,-1,0, 1,2])


Label: Y = np.array([1,1,0,1,1])
it’s impossible for us to find a line to separate the yellow (1)and purple (0) dots (shown on
the left).
Apply transformation X² to get:
New Feature: X = np.array([4,1,0, 1,4])
By combing the existing and new feature, we can certainly draw a line to separate the
yellow purple dots (shown on the right).
The “Kernel Trick”
Logistic Regression
What is Regression?
Logistic Regression
• Logistic regression measures the relationship between the categorical
dependent variable and one or more independent variables by estimating
probabilities using a logistic function.

• It uses a black box function to understand the relation between the


categorical dependent variable and the independent variables.

• This black box function is popularly known as the Softmax funciton.


Logistic Regression
Ex. Shop owner would like to predict the customer who entered into the shop
will buy the Macbook or Not.To predict whether the customer will buy the
MacBook or not. The shop owner will observe the customer features like

Gender:Probabilities wise male will high chances of purchasing a MacBook than females.

Age:Kids won’t purchase MacBook.

• The logistic regression model will pass the likelihood occurrences through
the logistic function to predict the corresponding target class.
• This popular logistic function is the Softmax function

The shop owner will use the above, similar kind of features to predict the likelihood
occurrence of the event ( what is event here?)
Logistic Regression
Ex. X axis : Age of person Y axis : Person has Smartphone or not.
Classification problem where given the age of a person and we have to predict if he
posses a smartphone or not.

In such a classification problem, can we use linear regression?


Logistic Regression
Can we solve above problem using Linear Regression?

• All the data points below that threshold will be classified as 0 i.e those who do not
have smartphones.
• Similarly, all the observations above the threshold will be classified as 1 which
means these people have smartphones.
Case 1: A new data point on the extreme right in the plot, suddenly you see the slope
of the line changes. Now we have to inadvertently change the threshold of our model.
Case 2: If we extend this line it will give you values above 1 and below 0.In our
classification problem, we do not know what the values greater than one and below
0 represents. so it is not the natural extension of the linear model.
Why not Linear Regression for Classification?

• Linear Regression predicts continuous variables like price of house,and the


output of the Linear Regression can range from negative infinity to positive
infinity.
• The predicted values is not probability value but a continuous value for the
classes,it will be very hard to find the right threshold that can help distinguish
between the classes.
• Even we figured out the right threshold for the binary class problem,However if
the problem would be multi-class it will not give the desirable prediction.
• In a multiclass problem there can n number of classes,Now each classes will be
labelled from 0-n.
• Suppose,we have 5 class problem 0,1,2,3 and 4 these classes won’t carry or
won’t be having any meaningful order.However,they would be forced to
establish some kind of relation between the dependent and the independent
features.
• Moreover the dependent variables would be taken as continuous numbers and
the best fit line would pass through the mean of the points,giving the out come in
continuous value that may go below 0 and may exceed 4.
Logistic Regression
• All the problems mentioned above is tackled by the Logistic Regression.
• The Logistic Regression instead for fitting the best fit line,condenses the output
of the linear function between 0 and 1

when b0+b1X == 0, then the p will be 0.5,


similarly,b0+b1X > 0, then the p will be going towards 1 and b0+b1X < 0, then the
p will be going towards 0
Logistic Regression
Interpretation of the Coefficients
• Interpretation of the weights differ from the Linear Regression as the output of
the Logistic Regression is in probabilities between 0 and 1.
• Instead of the slope co-efficient(b) being the rate of change of the p as x
changes,now the slope co-efficient is interpreted as the rate of change of the
“log odds” as X changes.
Logistic Regression

HPenguin wants to know, how likely it will be happy based on its daily activities.
No. Penguin Activity Activity Description How Penguin felt ( Target )

1 X1 Eating squids Happy


2 X2 Eating small Fishes Happy

3 X3 Hit by other Penguin Sad


4 X4 Eating Crabs Sad
Logistic Regression
HPenguin wants to know, how likely it will be happy based on its daily activities.
No. Penguin Activity Weights Target How Penguin felt
Activity Score ( Target )

1 X1 6 0.6 1 Happy
2 X2 3 0.4 1 Happy
3 X3 7 -0.7 0 Sad
4 X4 3 -0.3 0 Sad

Activity Score:
The activity score is more like the numerical equivalent to the penguin activity.
Weights:
• The Weights more like the weightages corresponding to the particular target.
• It means to say if the penguin performs the activity X1 the model is 60% confident to
say the penguin will be happy.
• If you observe the weights for the target class happy are positive, and the weights for
the target class sad are negative.
Logistic Regression
• To predict how the penguin will feel given the activity :

 Multiply the activity score and the corresponding weight to get the score. T
 The calculated score is also known as the logits.
 The logit (Score) will pass into the softmax function to get the probability for
each target class.
 Pass the logit through the softmax function will get the probability for the
target happy class and for the target sad class.
 The target class with high probability as the predicted target class for the
given activity.\
 If the Logit is greater than 0 the target class is happy and if the logit is less
than 0 the target class is sad.
Logistic Regression
Binary classification with Logistic Regression model
1.The weights will be calculated over the training data set.
2.Using the calculated the weights the Logits will be computed.
3.The calculated Logits (score) will pass through the softmax function.
4.The softmax function will return the probabilities for each target class.
5.The high probability target class will be the predicted target class

1. Consider a model with features x1, x2, x3 … xn. Let the binary output be denoted by Y,
that can take the values 0 or 1. Let p be the probability of Y = 1, we can denote it as p =
P(Y=1).
The mathematical relationship between these variables can be denoted as

Here the term p/(1−p) is known as the odds and denotes the likelihood of the event taking
place. Thus ln(p/(1−p)) is known as the log odds and is simply used to map the probability
that lies between 0 and 1 to a range between (−∞,+∞). The terms b0, b1, b2… are parameters
(or weights) that we will estimate during training.
Logistic Regression
Binary classification with Logistic Regression model
1. The log term ln on the LHS can be removed by raising the RHS as a power of e:

2. Now we can easily simplify to obtain the value of p

3. The Sigmoid Function is given by:


Cost function for Logistic regression
• For linear regression, the cost function is mostly we use Mean squared error
represented as the difference y_predicted and y_actual iterated overall data points,
and then you do a square and take the average.

The cost function for Logistic regression


Logistic Regression
Assumptions of Logistic Regression

• The dependent variable should be categorical (binary, ordinal, nominal or


count occurrences).
• The predictor or independent variable should be continuous or categorical.
• The correlation among the predictors or independent variable
• (multi-collinearity) should not be severe but there exists linearity of
independent variables and log odds.
• The data should be the representative part of population and record the data
in the order its collected.
• The model should provide a good fit of the data.
Logistic Regression
Applications of Logistic Regression
• This is a very useful technique in the field of Marketing, for predicting if the
company will make profit, loss or it will remain ate break-even based on the
operations.
• It can be used by the company to predict the attendance of their employees by
studying the pattern in which they take leaves, and also according to their
individual characteristics.
• Can turn out to be a useful technique for medical purposes. It can predict the
medical condition of a patient based on his/her medical history, symptoms and
individual characteristics and also comparing him/her with other patients as well.
• Because of it’s efficient and straight-forward nature, it is easy to implement and
therefore it is widely used by data analyst and scientist.
Thank You

You might also like