Unit Ii
Unit Ii
• Course Objectives:
1. To understand the basic concepts of machine learning and apply them for the various
problems.
2. To learn various machine learning types and use it for the various machine learning tasks.
3. To optimize the machine learning model and generalize it.
• Course Outcomes:
On completion of the course, students will be able to–
CO1: Apply basic concepts of machine learning and different types of machine learning
algorithms.
CO2 : Compare different types of classification models and their relevant application
CO3 : Differentiate various regression techniques and evaluate their performance.
CO4: Illustrate the tree-based and probabilistic machine learning algorithms
CO5: Identify different unsupervised learning algorithms for the related real world problems
CO6: Apply fundamental concepts of ANN.
Unit II CLASSIFICATION
SYLLABUS
Linear Models: Introduction, Linear Support Vector Machines (SVM) – Introduction, Soft
Margin SVM,Introduction to various SVM Kernel to handle non-linear data – RBF,
Gaussian, Polynomial, Sigmoid.
• Classification algorithms applied to the training data find the same pattern (similar number
sequences, words or sentiments, and the like) in future data sets.
• Sentiment analysis - used for categorizing unstructured text by opinion polarity (positive,
negative or neutral)
Binary Classification: Linear Classification Model
Classification Terminologies In Machine Learning
• Classifier – It is an algorithm that is used to map the input data to a specific category.
• Classification Model – The model predicts or draws a conclusion to the input data given for
training, it will predict the class or category for the data.
• Binary Classification – It is a type of classification with two outcomes, for eg – either true
or false.
• Multi-Class Classification – The classification with more than two classes, in multi-class
classification each sample is assigned to one and only one label or target.
• Train the Classifier – Each classifier in sci-kit learn uses the fit(X, y) method to fit the
model for training the train X and train label y.
• Predict the Target – For an unlabeled observation X, the predict(X) method returns predicted
label y.
• Evaluate – This basically means the evaluation of the model i.e classification report,
accuracy score, etc.
Binary Classification: Linear Classification Model
Linear separability: A dataset is linearly separable if there is at least one line that
clearly distinguishes the classes.
• Examples
Email Spam Detector
A handwritten character, classify it as one of the known characters.
A patient diagnosed with a disease or not
Binary Classification: Linear Classification Model
• In classification algorithm, a discrete output function(y) is mapped to input
variable(x)
• Examples
Email Spam Detector
A handwritten character, classify it as one of the known characters.
A patient diagnosed with a disease or not
Performance Evaluation: Confusion Matrix
• In any binary classification task,model can only achieve two results, either our
model is correct or incorrect in the prediction where we only have two classes.
Few
• In classification algorithm, a discrete output function(y) is mapped to input
variable(x)
• Examples
Email Spam Detector
A handwritten character, classify it as one of the known characters.
A patient diagnosed with a disease or not
Few Definitions
• The objects of interest in machine learning are usually referred to as instances.
• The set of all possible instances is called the instance space, denoted X
• label space L and the output space Y
• Model: a mapping from the instance space to the output space
• In classification the output space is a set of classes, while in regression it is the
set of real numbers.
• In order to learn such a model we require a training set Tr of labelled instances
(x,l(x)), also called examples, where l : X → L is a labelling function
• Some of the labelled data is usually set aside for evaluating or testing a
classifier, in which case it is called a test set and denoted by Te. We use
superscripts to restrict training or test set to a particular class:
e.g., Te⊕ = {(x,l(x))|x ∈ Te, l(x) = ⊕}
is the set of positive test examples, and Te is the set of negative test
examples
1 if x is a positive example
L=
0 if x is a negative example
Each car is represented by such an ordered pair (x, r) and the training
set contains N such examples
Classification
(p1 ≤ price ≤ p2) AND (e1 ≤ engine power ≤ e2)
1 if h classifies x as a positive
h(x) =
0 if h classifies x as a negative example
True Positive:
Interpretation: You predicted positive and it’s true.
You predicted that a woman is pregnant and she actually is.
True Negative:
Interpretation: You predicted negative and it’s true.
You predicted that a man is not pregnant and he actually is not.
False Positive: (Type 1 Error)
Interpretation: You predicted positive and it’s false.
You predicted that a man is pregnant but he actually is not.
False Negative: (Type 2 Error)
Interpretation: You predicted negative and it’s false.
You predicted that a woman is not pregnant but she actually is.
Performance Evaluation: Confusion Matrix
Confusion Matrix
1.True Positive Rate (TPR)
(Sensitivity or Recall)
TPR = TP/TP+FN
2. True Negative Rate (TNR)
(Specificity)
TNR = TN/TN+FP
3. False Positive Rate
FPR = FP/TN+FP
4. False Negative Rate
FNR=FN/TP +FN
5. Positive Predictive Value (PPV) Precision
PPV=TP/TP+FP
6. Negative Predictive Value (NPV)
NPV = TN/TN+FN
7. False Discovery Rate
FDR=FP/FP+TP
8.False Omission Rate
FOR=FN/FN+TN
Performance Evaluation: Confusion Matrix
Confusion Matrix
Precision is the ratio of True Positives to all the positives predicted by the model.
Low precision: the more False positives the model predicts, the lower the precision.
Recall (Sensitivity)is the ratio of True Positives to all the positives in your Dataset.
Low recall: the more False Negatives the model predicts, the lower the recall .
Performance Evaluation: Confusion Matrix
Recall versus Precision
In case 1, which scenario do you think will have the highest cost?
Imagine that if we predict COVID-19 residents as healthy patients and they do not need to
quarantine, there would be a massive number of COVID-19 infections. The cost of false
negatives is much higher than the cost of false positives.
Performance Evaluation: Confusion Matrix
Recall versus Precision
In case 2, which scenario do you think will have the highest cost?
Missing important emails will clearly be more of a problem than receiving spam, we can say
that in this case, FP will have a higher cost than FN.
Performance Evaluation: Confusion Matrix
Recall versus Precision
In case3. which scenario do you think will have the highest cost?
The banks would lose a bunch amount of money if the actual bad loans are predicted as good
loans due to loans not being repaid. On the other hand, banks won't be able to make more
revenue if the actual good loans are predicted as bad loans. Therefore, the cost of False
Negatives is much higher than the cost of False Positives.
Performance Evaluation: Confusion Matrix
In practice, the cost of false negatives is not the same as the cost of false
positives, depending on the different specific cases. It is evident that not only
should we calculate accuracy, but we should also evaluate our model using other
metrics, for example, Recall and Precision.
Performance Evaluation: Confusion Matrix
Ex 1. Dataset contains 33 spam,67 ham mails.When classifier is trained it has
predicted correctly 27 spam mails and 57 ham mails. Find the confusion matrix
& all evaluation metrices
Create confusion matrix , find accuracy,recall,precision.
Ex 2
Total of 10 cats and dogs and our model predicts whether it is a cat or not.
Actual values = [‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’]
Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘cat’, ‘cat’,
‘cat’]
Cat = positive Dog= Negative
Ex 2. 165 patients were being tested for the presence of that disease.Out of those
165 cases, the classifier predicted "yes" 110 times, and "no" 55 times.In reality,
105 patients in the sample have the disease, and 60 patients do not.Find the
confusion matrix & all evaluation metrices
Actual
Positive Negative
Positive
P
R
E
D
I
C
T
E
D
Negative
Performance Evaluation: F1-Score:
Out of the 300 fraudulent transactions, only 100 fraudulent transactions are classified
correctly.Classifier missed 200 out of the 300 fraudulent transactions!
ROC curve is created by plotting the True Positive Pate (TPR) against the False Positive
Rate (FPR) at various threshold settings, so you calculate both:
Performance Evaluation: ROC Curve
• The confusion matrix for the Threshold=9.0 case:
n*(n-1)/ 2 = 3*(3-1)/2 =3
binary classifiers have to be
generated
Multi-class Classification:
Multiclass Classification Strategies:
OneVsRest:
Multi-class Classification
examples
label Same setup where we have a set
apple
of features for each example
orange
banana
banana
pineapple
46
Multi-class Classification
48
Multi-class Classification
49
apple vs. not
Multi-class Classification
How do we classify?
50
apple vs. not
Multi-class Classification
Predicted C1 C2 C3
Class C1 15 7 2
C2 2 15 3
C3 3 8 45
Basic principles of classification
Basic principles of classification
•All objects before the coast line are boats and all objects after the coast line
are houses.
•Coast line serves as a decision surface that separates two classes.
Basic principles of classification
Unseen (new) objects are classified as “boats” if they fall below the
decision surface and as “houses” if the fall above it
Basic principles of classification
Find a linear decision surface (“hyperplane”) that can separate patient classes and has the
largest distance (i.e., largest “gap” or “margin”) between border-line patients (i.e., “support
vectors”);
SVM
1. Linear SVM – Hard Margin Classifier
Used for perfectly separated dataset ( linear
classification ),called “Linear SVM – Hard Margin
Classifier”.
gin
These are ar
M
Support d
d
Vectors xi
d
SVM
Separating Hyperplanes
yi 1 Which one should we
yi 1 choose!
That is why the objective of the SVM is to find the optimal separating hyperplane which
maximizes the margin of the training data.
Support VECTOR Machine Mathematics Behind SVM
• Basics of VECTOR:
• Definition: A vector is an object that has both a magnitude and a direction
1) The magnitude
• The magnitude or length of a vector x is written ||x|| and is called its norm.
→
• e.g. vector OA, ||OA|| is the length of the segment OA
cos(α)=u2/||u||=4/5=0.8
The direction of u(3,4) is the vector
w(0.6,0.8)
If vector u as the
direction of y then
u=y / ||y||
Compute the margin of the hyperplane SVM
Hyperplane, which separates two group of Compute distance between the point
data A(3,4) and the hyperplane.
Compute the margin of the hyperplane SVM
.
point A as a vector from the origin to A.
Take projection of vector A on unit vector w
Compute the margin of the hyperplane SVM
.
point A as a vector from the origin to A.
Take projection of vector A on unit vector w
Compute the margin of the hyperplane SVM
1. w=(2,1) which is normal to the hyperplane, and a=(3,4) which is the vector between
the origin and A.
||w|| =
.
Maximize the distance between the twoSVM
hyperplanes
.
Maximize the distance between the twoSVM
hyperplanes
.
SVM
How to find the distance between the two hyperplanes
.
SVM
found the couple (w,b) for which ||w|| is the smallest possible and the constraints we
fixed are met. Which means we will have the equation of the optimal hyperplane !
We find w and b by solving the following objective function using Quadratic Programming.
Linear SVM : Soft Margin Classifier
• An ideal SVM analysis should produce a hyperplane that completely separates
the vectors (cases) into two non-overlapping classes.
• However, perfect separation may not be possible, or it may result in a model
with so many cases that the model does not classify correctly.
• In this situation SVM finds the hyperplane that maximizes the margin and
minimizes the misclassifications.
Hard Margin Classifier wont work due to the inequality constraint yi(wTxi+1)≥1
Linear SVM : Soft Margin Classifier
•
Hard Margin Classifier wont work due to the inequality constraint yi(wTxi+1)≥1
The algorithm tries to maintain the slack variable to zero while maximizing margin.
However, it does not minimize the number of misclassifications (NP-complete
problem) but the sum of distances from the margin hyperplanes.
Linear SVM : Soft Margin Classifier
The Slack Variable indicates how much the point can violate the margin.
The Slack Variable helps to define 3 types of data points:
• If C→0, then the loss is zero and we are trying to maximize the margin.
• If C→∞ then the margin does not have any effect and the objective function tries
to just minimize the loss.
• In other words, the Hyper Parameter C controls the relative weighting between
the twin goals of making margin large and ensures that most examples have
functional margin at least 1.
Non-Linear SVM
•
When it is almost difficult to separate non-linear classes, then apply another trick
called kernel trick that helps handle the data.
Non-linear SVMs
Φ: x → φ(x)
SVM
• If such linear decision surface does not exist, the data is mapped into a much higher
dimensional space (“feature space”) where the separating decision surface is found;
•The feature space is constructed via very clever mathematical projection (“kernel trick”).
The “Kernel Trick”
• Kernel trick:The kernel function transform the data into a higher dimensional
feature space to make it possible to perform the linear separation.
• Linear: The linear kernel does not transform the data at all. Therefore, it can be expressed
simply as the dot product of the features:
• Gaussian (radial-basis function):. The RBF kernel performs well on many types
of data and is thought to be a reasonable starting point for many learning
tasks
The “Kernel Trick”
Gender:Probabilities wise male will high chances of purchasing a MacBook than females.
• The logistic regression model will pass the likelihood occurrences through
the logistic function to predict the corresponding target class.
• This popular logistic function is the Softmax function
The shop owner will use the above, similar kind of features to predict the likelihood
occurrence of the event ( what is event here?)
Logistic Regression
Ex. X axis : Age of person Y axis : Person has Smartphone or not.
Classification problem where given the age of a person and we have to predict if he
posses a smartphone or not.
• All the data points below that threshold will be classified as 0 i.e those who do not
have smartphones.
• Similarly, all the observations above the threshold will be classified as 1 which
means these people have smartphones.
Case 1: A new data point on the extreme right in the plot, suddenly you see the slope
of the line changes. Now we have to inadvertently change the threshold of our model.
Case 2: If we extend this line it will give you values above 1 and below 0.In our
classification problem, we do not know what the values greater than one and below
0 represents. so it is not the natural extension of the linear model.
Why not Linear Regression for Classification?
HPenguin wants to know, how likely it will be happy based on its daily activities.
No. Penguin Activity Activity Description How Penguin felt ( Target )
1 X1 6 0.6 1 Happy
2 X2 3 0.4 1 Happy
3 X3 7 -0.7 0 Sad
4 X4 3 -0.3 0 Sad
Activity Score:
The activity score is more like the numerical equivalent to the penguin activity.
Weights:
• The Weights more like the weightages corresponding to the particular target.
• It means to say if the penguin performs the activity X1 the model is 60% confident to
say the penguin will be happy.
• If you observe the weights for the target class happy are positive, and the weights for
the target class sad are negative.
Logistic Regression
• To predict how the penguin will feel given the activity :
Multiply the activity score and the corresponding weight to get the score. T
The calculated score is also known as the logits.
The logit (Score) will pass into the softmax function to get the probability for
each target class.
Pass the logit through the softmax function will get the probability for the
target happy class and for the target sad class.
The target class with high probability as the predicted target class for the
given activity.\
If the Logit is greater than 0 the target class is happy and if the logit is less
than 0 the target class is sad.
Logistic Regression
Binary classification with Logistic Regression model
1.The weights will be calculated over the training data set.
2.Using the calculated the weights the Logits will be computed.
3.The calculated Logits (score) will pass through the softmax function.
4.The softmax function will return the probabilities for each target class.
5.The high probability target class will be the predicted target class
1. Consider a model with features x1, x2, x3 … xn. Let the binary output be denoted by Y,
that can take the values 0 or 1. Let p be the probability of Y = 1, we can denote it as p =
P(Y=1).
The mathematical relationship between these variables can be denoted as
Here the term p/(1−p) is known as the odds and denotes the likelihood of the event taking
place. Thus ln(p/(1−p)) is known as the log odds and is simply used to map the probability
that lies between 0 and 1 to a range between (−∞,+∞). The terms b0, b1, b2… are parameters
(or weights) that we will estimate during training.
Logistic Regression
Binary classification with Logistic Regression model
1. The log term ln on the LHS can be removed by raising the RHS as a power of e: