0% found this document useful (0 votes)
3 views39 pages

ML Classification Trupesh Patel

Uploaded by

Mraryanraj03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views39 pages

ML Classification Trupesh Patel

Uploaded by

Mraryanraj03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Contents

❖ Logistic Regression
❖ Support Vector Machine
❖ K- Nearest neighbour (KNN)
Logistic regression
Introduction
❖ Logistic Regression is commonly used to estimate the probability that
an instance belongs to a particular class (e.g., what is the probability
that this email is spam?).
❖ If the estimated probability is greater than 50%, then the model predicts
that the instance belongs to that class (called the positive class, labeled
“1”), or else it predicts that it does not (i.e., it belongs to the negative
class, labeled “0”). This makes it a binary classifier.
Estimating Probabilities :
❖ Logistic Regression model computes a weighted sum of the input features
(plus a bias term), but instead of outputting the result directly like the Linear
Regression model does, it outputs the logistic of this result.

The logistic—also called the logit, noted σ(·)—is a sigmoid function (i.e.,
S-shaped) that outputs a number between 0 and 1.
Logistic Function
❖ Once the Logistic Regression model has estimated the probability p = hθ (x)
that an instance x belongs to the positive class, it can make its prediction ŷ
easily.

❖ Logistic Regression model prediction y = 0 if p < 0 . 5, 1 if p ≥ 0 . 5 .


Training and Cost function
❖ The objective of training is to set the parameter vector θ so that the
model estimates high probabilities for positive instances (y = 1) and low
probabilities for negative instances (y = 0). This idea is captured by the
cost function for a single training instance x.

❖ Cost function of a single training instance


Training and Cost function
❖ This cost function makes sense because – log(t) grows very large when t
approaches 0, so the cost will be large if the model estimates a
probability close to 0 for a positive instance, and it will also be very large
if the model estimates a probability close to 1 for a negative instance.
❖ On the other hand, – log(t) is close to 0 when t is close to 1, so the cost
will be close to 0 if the estimated probability is close to 0 for a negative
instance or close to 1 for a positive instance, which is precisely what we
want. The cost function over the whole training set is simply the average
cost over all train‐ ing instances.
❖ It can be written in a single expression (as you can verify easily), called
the log loss.
Training and Cost function
❖ The bad news is that there is no known closed-form equation to compute
the value of θ that minimizes this cost function (there is no equivalent of
the Normal Equation).
❖ But the good news is that this cost function is convex, so Gradient
Descent (or any other optimization algorithm) is guaranteed to find the
global minimum (if the learn‐ ing rate is not too large and you wait long
enough).
❖ The partial derivatives of the cost function with regards to the jth model
parameter θj is given by
Training and Cost function
❖ for each instance it computes the prediction error and multiplies it by the
jth feature value, and then it computes the average over all training
instances.
❖ Once you have the gradient vector containing all the partial derivatives
you can use it in the Batch Gradient Descent algorithm. That’s it: you
now know how to train a Logistic Regression model.
Decision Boundaries :
❖ Let’s use the iris dataset to illustrate Logistic Regression. This is a
famous dataset that contains the sepal and petal length and width of
150 iris flowers of three different species: Iris-Setosa, Iris-Versicolor, and
Iris-Virginica.
Decision Boundaries :
❖ Let’s try to build a classifier to detect the Iris-Virginica type based only
on the petal width feature. First let’s load the data:
Decision Boundaries :
❖ Now let’s train a Logistic Regression model:
Decision Boundaries :
❖ Let’s look at the model’s estimated probabilities for flowers with petal
widths varying from 0 to 3 cm.
Decision Boundaries :
❖ The petal width of Iris-Virginica flowers (represented by triangles)
ranges from 1.4 cm to 2.5 cm, while the other iris flowers (represented by
squares) generally have a smaller petal width, ranging from 0.1 cm to 1.8
cm.

Estimated probabilities and decision boundary


Decision Boundaries :
❖ There is a decision
boundary at around 1.6 cm
where both probabilities are
equal to 50%: if the petal
width is higher than 1.6 cm,
the classifier will predict
that the flower is an
IrisVirginica, or else it will
predict that it is not (even if
it is not very confident):
Decision Boundaries :
❖ Fig displays two features: petal width
and length.
❖ Once trained, the Logistic Regression
classifier can estimate the probability
that a new flower is an Iris-Virginica
based on these two features.
❖ The dashed line represents the points
where the model estimates a 50%
prrobability: this is the model’s
decision boundary. Note that it is a
linear boundary.17 Each parallel line
represents the points where the model
outputs a specific probability, from 15%
(bottom left) to 90% (top right). All the
flowers beyond the top-right line have
an over 90% chance of being
Iris-Virginica according to the model.
Overfitting :

❖ Overfitting is a modeling error that occurs when a function or model is too closely fit
the training set and resulting in a drastic difference of fitting in the test set.
Examples :
❖ Overfitting is a modeling error that occurs when a function or model is too closely fit
the training set and resulting in a drastic difference of fitting in the test set.

❖ we need to predict if a student will land a job interview based on his resume. Now
assume we train a model from a dataset of 20,000 resumes and their outcomes.

❖ Then we try a model out on the original dataset and it predicts outcomes with 98%
Accuracy… Wow! It’s Amazing, but not in Reality.

❖ But now comes the bad news. When we run a model out on the new dataset of
resumes, we only get 50% of Accuracy.

❖ Our model doesn’t get generalized well from our training data to see unseen data.
This is known an Overfitting and it is a common problem in Data Science.

❖ In fact, Overfitting occurs in the real world all the time. We need to handle it to
generalize the model.
Find overfitting :

❖ The primary challenge in machine learning and in data science is that we can't
evaluate the model performance until we test it. So the first step to finding the
Overfitting is to split the data into the Training and Testing set.

❖ The performance can be measured using the percentage of accuracy observed in


both data sets to conclude on the presence of overfitting. If the model performs
better on the training set than on the test set, it means that the model is likely
overfitting. For example, it would be a big Alert if our model saw 99% accuracy on
the training set but only 50% accuracy on the test set.
Prevent overfitting :

❖ Training with more data

❖ Data augmentation

❖ Cross validation
❖ Feature selection
❖ Regularization
Regularization :

❖ Keep all the features but reduce the magnitude /value of parameters (theta J) to
make the value smaller.

❖ Works well when we have a lot of features, each of which contributes a bit to
predicting y.
Regularization :

❖ Modify the cost function by adding an extra regularization term in the end to shrink every single
parameter (e.g. close to 0)

❖ lambda (regularization parameter) controls the tradeoff between two goals:

❖ former formula — 1st goal: fit the training data well

❖ extra lambda (purple) — 2nd goal: keep the parameters small to avoid overfitting
❖ If all parameters (theta) are close to 0, the result will be close to 0. -> it will generate a flat
straight line that fails to fit the features well → underfit

❖ To sum up, if lambda is chosen to be too large, it may smooth out the function too much and
cause underfitting.
Support vector
machine
Support vector machine
❖ A Support Vector Machine (SVM) is a very powerful and versatile Machine
Learning model, capable of performing linear or nonlinear classification,
regression, and even outlier detection

❖ SVMs are particularly well suited for classification of complex but small- or
medium-sized datasets.
❖ Applications : face detection, text and hypertext categorization ,
classification of images , handwriting recognition
Linear Classifiers α
x f yest
denotes +1
f(x,w,b) = sign(w. x - b)
denotes -1

How would you


classify this data?
Linear Classifiers α
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

How would you


classify this data?
Linear Classifiers α
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

How would you


classify this data?
Linear Classifiers α
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

How would you


classify this data?
Linear Classifiers α
x f yest
denotes +1 f(x,w,b) = sign(w. x - b)
denotes -1

Any of these
would be fine..

..but which is
best?
Linear Classifiers α
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
Define the margin of
denotes -1
a linear classifier as
the width that the
boundary could be
increased by before
hitting a datapoint.
Linear Classifiers α
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
The maximum margin
denotes -1 linear classifier is the
linear classifier with the,
um, maximum margin.
This is the simplest kind of
SVM (Called an LSVM)
Linear Classifiers α
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
The maximum margin
denotes -1 linear classifier is the
linear classifier with the,
Support Vectors are um, maximum margin.
those data points that
the margin pushes up This is the simplest kind of
against SVM (Called an LSVM)
Why max margin?
1. Intuitively this feels safest.
2. If we’ve madef(x,w,b)
a small error=in sign(w.
the locationxof- the
b)
denotes +1 boundary (it’s been jolted in its perpendicular
direction) this gives us least
Thechance of causing margin
maximum a
denotes -1
misclassification.
linear classifier is the
3. LOOCV is easy since the model is immune to removal
linear classifier with
of any non-support-vector datapoints.
Support Vectors are the, um, maximum
4. There’s some theory (using VC dimension) that is
those datapoints margin.
related to (but not the same as) the proposition that
that the margin this is a good thing.
pushes up against This is the simplest
5. Empirically it works very very well.
kind of SVM (Called an
LSVM)
Specifying a line
one
and margin
1” z Plus-Plane
=+
C l ass Classifier Boundary
red ict Minus-Plane
“P -1”
=
Cl ass
re d ict zone
“P

• How do we represent this mathematically?


• …in m input dimensions?
Specifying a line
e
and margin
” zon Plus-Plane
+1 Classifier Boundary
l a s s=
i ctC Minus-Plane
P red zone
“ -1”
=1 ss =
+ b C l a
wx b=0
ed i ct
+
wx b=-1
+ “ Pr
w x

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Classify as.. +1 if w . x + b >= 1
-1 if w . x + b <= -1
Universe if -1 < w . x + b < 1
explodes
K-nearest Neighbor(KNN)
The k-nearest neighbors classifier (kNN) is a non-parametric supervised
machine learning algorithm. It’s distance-based: it classifies objects based
on their proximate neighbors’ classes.

Non-parametric means that there is no fine-tuning of parameters in the


training step of the model. Although k can be considered an algorithm
parameter in some sense, it’s actually a hyperparameter. It’s selected
manually and remains fixed at both training and inference time.
K-nearest Neighbor(KNN)
The k-nearest neighbors classifier (kNN) is a non-parametric supervised
machine learning algorithm. It’s distance-based: it classifies objects based
on their proximate neighbors’ classes.

Non-parametric means that there is no fine-tuning of parameters in the


training step of the model. Although k can be considered an algorithm
parameter in some sense, it’s actually a hyperparameter. It’s selected
manually and remains fixed at both training and inference time.
K-nearest Neighbor(KNN)
The k-nearest neighbors algorithm is also non-linear. In contrast to simpler
models like linear regression, it will work well with data in which the
relationship between the independent variable (x) and the dependent
variable (y) is not a straight line.
What is k in k-nearest neighbors?
The parameter k in kNN refers to the number of labeled points (neighbors)
considered for classification. The value of k indicates the number of these
points used to determine the result. Our task is to calculate the distance and
identify which categories are closest to our unknown entity.
K-nearest Neighbor(KNN)
How does it works?
Given a point whose class we do not know, we can try to understand which
points in our feature space are closest to it.
These points are the k-nearest neighbors. Since similar things occupy
similar places in feature space, it’s very likely that the point belongs to the
same class as its neighbors.
Based on that, it’s possible to classify a new point as belonging to one class
or another.

You might also like