0% found this document useful (0 votes)
43 views84 pages

AIML ML Session 5 Session 6 - Student Common Reference (With More Additional Read

The document outlines a course on Linear Models for Classification in Machine Learning, focusing on various classification techniques such as logistic regression and decision theory. It discusses the process of classification, applications of logistic regression, and the differences between generative and discriminative models. Additionally, it emphasizes the importance of decision boundaries and the handling of missing values in data preprocessing.

Uploaded by

vinay kunar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views84 pages

AIML ML Session 5 Session 6 - Student Common Reference (With More Additional Read

The document outlines a course on Linear Models for Classification in Machine Learning, focusing on various classification techniques such as logistic regression and decision theory. It discusses the process of classification, applications of logistic regression, and the differences between generative and discriminative models. Additionally, it emphasizes the importance of decision boundaries and the handling of missing values in data preprocessing.

Uploaded by

vinay kunar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Machine Learning

AIML CZG565
M4 : Linear Models for Classification

Course Faculty of MTech Cluster


BITS Pilani BITS – CSIS - WILP
Pilani Campus
Machine Learning
Disclaimer and Acknowledgement

• These content of modules & context under topics are planned by the course owner Dr. Sugata,
with grateful acknowledgement to many others who made their course materials freely
available online.
• The content for these slides has been obtained from books and various other source on the
Internet
• We here by acknowledge all the contributors for their material and inputs.
• We have provided source information wherever necessary

• To ease student’s reading , we have added additional slides in this canvas upload, that are not
shown in the live class for detailed explanation
• Students are requested to refer to the textbook w.r.t detailed content of this presentation deck
shared over canvas
Slide Source / Preparation / Review:
From BITS Pilani WILP: Prof.Sugata, Prof.Chetana, Prof.Rajavadhana, Prof.Monali, Prof.Sangeetha, Prof.Swarna,
Prof.Pankaj
External: CS109 and CS229 Stanford lecture notes, Dr.Andrew NG and many others who made their course
materials freely available online.

BITS Pilani, Pilani Campus


Agenda

• Discriminant Functions

• Probabilistic Generative Classifiers

• Probabilistic Discriminative Classifiers

• Logistic Regression

• Applications : Text classification model

BITS Pilani, Pilani Campus


Decision Theory
&
Objective of Classification Models

BITS Pilani
Classification

• Given a collection of records (training set )


– Each record is by characterized by a tuple (x,y), where x is the attribute
(feature) set and y is the class label
• x aka attribute, predictor, independent variable, input
• Y aka class, response, dependent variable, output
• Task
– Learn a model or function that maps each attribute set x into one of the
predefined class labels y

Task Attribute set, x Class label, y

Categorizing Features extracted from email spam or non-spam


email messages message header and content

Identifying tumor Features extracted from x-rays or malignant or benign cells


cells MRI scans

Cataloging Features extracted from telescope Elliptical, spiral, or irregular-


galaxies images shaped galaxies

BITS Pilani, Pilani Campus


Logistic Regression Applications

• Credit Card Fraud : Predicting if a given credit card transaction is


fraud or not

• Health : Predicting if a given mass of tissue is benign or malignant

• Marketing : Predicting if a given user will buy an insurance product or


not

• Banking : Predicting if a customer will default on a loan.

BITS Pilani, Pilani Campus


Inductive Learning Hypothesis : Interpretation

• Target Concept : t

• Discrete : f(x) ∈ {Yes, No, Maybe} Classification


• Continuous : f(x) ∈ [20-100] Regression

• Probability Estimation : f(x) ∈ [0-1]

7
BITS Pilani, Pilani Campus
Decision Theory

• Target Concept : t

• Discrete : f(x) ∈ {Yes, No} ie., t ∈ {0, 1} Binary Classification

• Continuous : f(x) ∈ [20-100]

• Probability Estimation : f(x) ∈ [0-1]

ML Task : Predict the Employability of interview candidates based on CGPA & IQ

Preprocess Implemented: CGPA IQ IQ Job Offered


Min-Max Normalization on IQ 5.5 6.7 100 1
5 7 105 0
8 6 90 1
9 7 105 1
6 8 120 0
7.5 7.3 110 08
BITS Pilani, Pilani Campus
How does logistic regression handle
missing values?

• Replace missing values with column averages (i.e. replace missing


values in feature 1 with the average for feature 1).

• Replace missing values with column medians.

• Impute missing values using the other features.

• Remove records that are missing features.

• Use a machine learning technique that uses classification trees,


e.g. Decision tree

BITS Pilani, Pilani Campus


Decision Theory :
The decision problem: given x, predict t according to a probabilistic model p(x, t)

• Target Concept : t

• Discrete : f(x) ∈ {Yes, No} ie., t ∈ {0, 1}

• Continuous : f(x) ∈ [20-100]

• Probability Estimation : f(x) ∈ [0-1]

p(x, C k ) is the (central!) inference problem

CGPA IQ IQ Job Offered P(Job = 1)


5.5 6.7 100 1 0.8
X =< 5 , 7 , > 105 0 0.4  = P(Ck| X)
8 6 90 1 0.75
9 7 105 1 0.95
6 8 120 0 0.35
7.5 7.3 110 0 0.4 10
BITS Pilani, Pilani Campus
Classification Problem: Stages

Job -
CPGA IQ
Offered Learning
5.5 6.7 1 algorithm
5 7 0
8 6 1
9 7 1
6 8 0 Induction/ Learn p(x,Cjob=1) & p(x,Cjob =0)
7.5 7.3 0 Inference Model for
step p(x,Ck)

Model
Training Set
Apply Model
Job - to find
CPGA IQ optimal t
Offered
3 4 ?
7 6 ? Deduction/
5.5 8 ? Decision Step

Test Set
11
BITS Pilani, Pilani Campus
Sample Rule / Hypothesis:
Decision Region IF CGPA>7 Job = 1
Else Job = 0
Training Set
Job - Model divides the input
CPGA IQ
Offered Learning space into regions Rk called
5.5 6.7 1 algorithm decision regions, one for
5 7 0 each class, such that all
8 6 1
points in Rk are assigned to
class Ck
9 7 1
A mistake occurs when an
6 8 0 Induction/ input vector belonging to
Learn
7.5 7.3 0 Inference Model for class C1 is assigned to class
step p(x,Ck) C2

Probability Probability
Model
CGPA = 7 CGPA = 7

Prob.Dist for
Job = 1
Prob.Dist for
Job = 0

CGPA CGPA 12
BITS Pilani, Pilani Campus
Misclassification Rate

To minimize p(mistake),
each x is assigned to
whichever class has the
smaller value of the
integrand

The minimum probability of


making a mistake is
obtained if each value of x is
assigned to the class for
which the posterior
probability p(Ck|x) is
largest.

13
BITS Pilani, Pilani Campus
Decision Theory - Summary

Probability CGPA = 7
CGPA = 8

CGPA

14
BITS Pilani, Pilani Campus
Linear Models for Classification

BITS Pilani
Types of Classification
Decision Theory: Interpretation Model Building

Generative

IQ
x2

CGPA
Job -
x1c CPGA IQ
Offered
5.5 6.7 1
P( X 1 X 2  X d | Y ) P(Y ) 5 7 0
P(Y | X 1 X 2  X n ) 
P( X 1 X 2  X d ) 8 6 1

Known as generative models, because by sampling 9 7 1


from them it is possible to generate synthetic data 6 8 0
points in the input space. 7.5 7.3 0
Eg.,Classification: Naïve Bayes, … …. …..
Clustering : Mixtures of Gaussians
BITS Pilani, Pilani Campus
Types of Classification
Decision Theory: Interpretation Model Building

Discriminative

𝜃0 + ෍ 𝜃𝑖 𝑥𝑖 y=1
𝑖

IQ
x2
Decision
y=0 Boundary

CGPA
x1 Job -
CPGA IQ
Offered
5.5 6.7 1
𝜃0 + ෍ 𝜃𝑖 𝑥𝑖 ≥ 0
5 7 0
𝑖
8 6 1
𝜃0 + ෍ 𝜃𝑖 𝑥𝑖 < 0
9 7 1
𝑖
Logistic regression, SVMs , tree based 6 8 0
classifiers (e.g. decision tree) Traditional neural 7.5 7.3 0
networks, Nearest neighbor … …. …..

BITS Pilani, Pilani Campus


Types of Classification
Decision Theory: Interpretation Model Building
CGPA
>6 <=6

IQ IQ

<7 >=7 >=7 <7


Label Label Label Label
0 1 0 1

𝐼𝐹 𝐶𝐺𝑃𝐴 > 6 𝑎𝑛𝑑 𝐼𝑄 ≥ 7 Job -


CPGA IQ
Job offered = 1 Offered
Else If CGPA <=6 and IQ <7 5.5 6.7 1
Job offered = 1
5 7 0
Else if CGPA <=6 and IQ >=7
Job offered = 0 8 6 1
……… 9 7 1
Logistic regression, SVMs , tree based 6 8 0
classifiers (e.g. decision tree) Traditional neural 7.5 7.3 0
networks, Nearest neighbor … …. …..

BITS Pilani, Pilani Campus


Types of Classification
Generative vs Discriminative Models Model Building
• Generative Model

• Class-conditional probability distribution of attribute/feature set and


prior probability of classes are learnt during the training phase
• Given these learnt probabilities, during inferencing phase, probability
of a test record belonging to different classes are calculated and
compared.
• Can result in linear or nonlinear decision surface

• Discriminative Model

• Given a training set, a function f is learnt that directly maps an


attribute/feature vector x to the output class (y=1 or 0/-1)
• A linear function f results in linear decision surface

BITS Pilani, Pilani Campus


Logistic Regression

Idea :
Given data X and associated binary (0/1) class label Y, Logistic Regression
tries to learn a discriminant function P(Y|X)
If Y = 1, P(Y|X) = 1 else P(Y|X) = 0

BITS
BITS Pilani, Pilani Pilani
Campus
Logistic Regression vs Least Squares Regression
Figure depicts
sample data points
(Yes) 1

Job Offered ?

(No) 0
CGPA or IQ

A Discriminant function f (x)


• Independent Attribute : CGPA or IQ directly map input to class labels
In two-class problem, f (.) is binary valued
• Can we solve the problem using linear
If ℎ𝜃 𝑥 >=0.5, predict “y = 1”
regression? E.g., fit a straight line and
If ℎ𝜃 𝑥 <0.5 , predict “y = 0”
define a threshold at 0.5
In this use case : h(x) = 0.7, implies that
• Threshold classifier output h(x) at 0.5: there is 70% of chance of the candidate
being selected in the interview

BITS Pilani, Pilani Campus


Decision Rules

• Classifier:
f (x,w) = wo + wT x (linear discriminant function)
• Decision rule is

• Mathematically
y = sign(w0 + wT x)

• This specifies a linear classifier: it has a linear boundary (hyperplane)

w0 + wT x = 0
A discriminant is a function that takes an input vector x and assigns it
to one of K classes, denoted Ck.

BITS Pilani, Pilani Campus


Logistic Regression vs Least Squares Regression

(Yes) 1

Job Offered ?

(No) 0
CGPA or IQ

Failure due to adding a new point

BITS Pilani, Pilani Campus


Logistic Regression vs Least Squares Regression
logistic
regression

least squares
regression

The right-hand plot shows the corresponding results obtained when extra data points
are added at the bottom left of the diagram, showing that least squares is highly
sensitive to outliers, unlike logistic regression.

24

BITS Pilani, Pilani Campus


Logistic Regression vs Least Squares Regression

• Linear Regression could help us predict the student’s test score on a


scale of 0 - 100. Linear regression predictions are continuous (numbers in
a range).

• Logistic Regression could help use predict whether the student passed
or failed. Logistic regression predictions are discrete (only specific values
or categories are allowed). We can also view probability scores underlying
the model’s classifications.
Intuition behind the model:
Classification requires discrete values: y = 0 or 1

For linear Regression output values: ℎ𝜃 𝑥 can be much > 1 or much < 0
Logistic Regression:
25

BITS Pilani, Pilani Campus


Sigmoid Function

• Sigmoid/logistic
function takes a real
value as input and
outputs another value
between 0 and 1
• That framework is
called logistic
regression
– Logistic: A special
mathematical
sigmoid function it
uses
– Regression:
Combines a weight
vector with
observations to
create an answer

26

BITS Pilani, Pilani Campus


Logistic Regression vs Least Squares Regression

Classification: y = 0 or 1 𝑌 = 𝜃0 + ෍ 𝜃𝑖 𝑥𝑖
𝑖

Another drawback
h(x) can be > 1 or < 0 of using linear
regression for this
problem

What we need: 𝑝
ln = 𝜃0 + ෍ 𝜃𝑖 𝑥𝑖
1−𝑝
𝑖

Logistic Regression:

BITS Pilani, Pilani Campus


Classification – Linear Vs Non Linear Decision
Boundary

• At decision boundary output of logistic regression is 0.5


BITS Pilani, Pilani Campus


Logistic Regression – Sample Linear Boundary

• At decision boundary output of logistic regression is 0.5

• ℎ𝜃 𝑥 = 𝑔 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2
– e.g., 𝜃0 = −3, 𝜃1 = 1, 𝜃2 = 1

Figure depicts
sample data points

Decision
IQ boundary

CGPA

• Predict “𝑦 = 1” if −3 + 𝑥1 + 𝑥2 ≥ 0

Slide credit: Andrew Ng

BITS Pilani, Pilani Campus


Logistic Regression – Sample Non-Linear
Boundary
IQ

CGPA

Figure depicts
sample data points

CGPA

IQ

Slide credit: Andrew Ng

BITS Pilani, Pilani Campus


Learning model parameters

• Training set:

• m examples

• How to choose parameters (feature weights) ?

31

BITS Pilani, Pilani Campus


Notion of Cost Function
in
Classification

BITS
BITS Pilani, Pilani Pilani
Campus
Logistic Regression

• Training set:
• How to choose parameters (feature weights)?

“non-convex” “convex”

𝜃 𝜃

BITS Pilani, Pilani Campus


Error (Cost) Function

• Our prediction function is non-linear (due to sigmoid transform)


• Squaring this prediction as we do in MSE results in a non-convex function with
many local minima.
• If our cost function has many local minimas, gradient descent may not find the
optimal global minimum.
• So instead of Mean Squared Error, we use a error/ cost function called Cross-
Entropy, also known as Log Loss.

Cross Entropy
• Cross-entropy loss, or log loss, measures the performance of a classification
model whose output is a probability value between 0 and 1.
• Cross-entropy loss increases as the predicted probability diverges from the
actual label. So predicting a probability of .012 when the actual observation label
is 1 would be bad and result in a high loss value.
• A perfect model would have a log loss of 0.
• Cross-entropy loss can be divided into two separate cost functions: one for y=1
and one for y=0.
BITS Pilani, Pilani Campus
Logistic regression cost function (cross entropy)

If y = 1
ie.,
Actual Value Job
Offered = 1

Cost

0 1 Predicted Value
Predicted Value ℎ𝜃 𝑥 Job Offered = 1
Job Offered = 0

BITS Pilani, Pilani Campus


Logistic regression cost function

If y = 0
ie.,
Actual Value
Job Offered = 0
Cost=0; If y=0 and hꝊ(x)=0

Cost

Predicted Value
0 1 Job Offered = 1
Predicted Value ℎ𝜃 𝑥
Job Offered = 0

BITS Pilani, Pilani Campus


Cost function

To fit parameters 𝜃 : Apply Gradient Descent Algorithm min 𝐽(𝜃)


𝜃

To make a prediction given new :


1
Output : ℎ𝜃 𝑥 = ⊤
1 + 𝑒 −𝜃 𝑥

BITS Pilani, Pilani Campus


Gradient Descent Algorithm

𝑚
1
𝐽 𝜃 =− ෍ 𝑦 (𝑖) log ℎ𝜃 𝑥 (𝑖) + (1 − 𝑦 (𝑖) ) log 1 − ℎ𝜃 𝑥 (𝑖)
𝑚
𝑖=1

Goal: min 𝐽(𝜃)


𝜃
Repeat
{
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝐽 𝜃
𝜕𝜃𝑗
}
𝑚
𝜕 1 𝑖 (𝑖)
𝐽 𝜃 = ෍(ℎ𝜃 𝑥 − 𝑦 (𝑖) ) 𝑥𝑗
𝜕𝜃𝑗 𝑚
𝑖=1

BITS Pilani, Pilani Campus


Gradient Descent Algorithm

Linear Regression
Repeat {
𝑚
1 𝑖 (𝑖)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 − 𝑦 (𝑖) 𝑥𝑗 ℎ𝜃 𝑥 = 𝜃 ⊤ 𝑥
𝑚
𝑖=1
}

Logistic Regression
Repeat {
𝑚
1 𝑖 (𝑖)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ෍ ℎ𝜃 𝑥 − 𝑦 (𝑖) 𝑥𝑗
𝑚 1
𝑖=1
ℎ𝜃 𝑥 = ⊤𝑥
} 1 + 𝑒 −𝜃

1 Slide credit: Andrew


ℎ𝜃 𝑥 = Ng
1 + 𝑒 −(𝜃0+𝜃1𝐶𝐺𝑃𝐴+𝜃2𝐼𝑄)

BITS Pilani, Pilani Campus


Logistic regression more generally

• Logistic regression when Y is not Boolean (but still discrete-valued).


• Now y ∈ {y1 ... yR} : learn R-1 sets of weights This is equivalent to
sigmoid function. Multiply
numerator & denominator
For k<R by exp(-𝜃 𝑇 x) in the
(all the 1st ,2nd ,….,(R-1)th label) original form to get this

For k=R

Eg., If the class has three distinct values with assumption that the dataset has 6 features
and linear decision boundary works:

Classifier learns two set of weights. Each set of weight has {𝜃0 , 𝜃1, 𝜃2……. 𝜃6}

For the third class value(label: k=Rth value) it uses the second formula for estimation.

BITS Pilani, Pilani Campus


Application of Logistic Regression
&
Problem Types

BITS Pilani
BITS Pilani, Pilani Campus
Example: Sentiment Analysis – With Engineered features

Sentiment Features

BITS Pilani, Pilani Campus


Classifying sentiment using logistic regression

BITS Pilani, Pilani Campus


Logistic Regression – Fit a Model – Apply Gradient Descent

CGPA IQ IQ Job Offered


Hyper parameters:
5.5 6.7 100 1
5 7 105 0
Learning Rate = 0.3
8 6 90 1
Initial Weights = (0.5, 0.5,0.5)
9 7 105 1
Regularization Constant = 0
6 8 120 0 𝜃TX = 0.5 +0.5 CGPA + 0.5 IQ
7.5 7.3 110 0
1
6
ℎ𝜃 𝑥 = ⊤𝑥
1 1 + 𝑒 −𝜃
𝜃0 ≔ 𝜃0 − 0.3 ෍ ℎ𝜃 𝑥 𝑖 − 𝑦 𝑖 (1) 1
6 ℎ𝜃 𝑥 =
𝑖=1
1
6 1 + 𝑒 −(0.5 +0.5 𝐶𝐺𝑃𝐴 + 0.5 𝐼𝑄 )
(𝑖)
𝜃𝐶𝐺𝑃𝐴 ≔ 𝜃𝐶𝐺𝑃𝐴 − 0.3 ෍ ℎ𝜃 𝑥 𝑖 − 𝑦 (𝑖) 𝑥𝐶𝐺𝑃𝐴
6
6 𝑖=1
1 (𝑖)
𝜃𝐼𝑄 ≔ 𝜃𝐼𝑄 − 0.3 ෍ ℎ𝜃 𝑥 𝑖 − 𝑦 (𝑖) 𝑥𝐼𝑄
6
𝑖=1
Approx. : New weights
𝜃0 = 0.4
𝜃1=𝐶𝐺𝑃𝐴 =-0.4
𝜃2=𝐼𝑄 = −0.6

BITS Pilani, Pilani Campus


Logistic Regression – Inference & Interpretation

CGPA IQ IQ Job Offered


Assume : 0.4+0.3CGPA-0.45IQ
5.5 6.7 100 1
5 7 105 0
8 6 90 1
Predict the Job offered for a candidate : (5, 6)
h(x) = 0.31
9 7 105 1
Y-Predicted = 0 / No
6 8 120 0
7.5 7.3 110 0

Note :
The exponential function of the regression coefficient (ew-cpga) is the odds ratio associated with a one-unit
increase in the cgpa.
+ The odd of being offered with job increase by a factor of 1.35 for every unit increase in the CGPA
[np.exp(model.params)]

BITS Pilani, Pilani Campus


Regularization

Note : This topic is already covered in the module 3 and the implementation
remains the same. Only one points added here w.r.t interpretation for
logistic regression

BITS Pilani
BITS Pilani, Pilani Campus
Ways to Control Overfitting – Interpretation of Hyper parameter

• Regularization
𝑛 # 𝑊𝑒𝑖𝑔ℎ𝑡𝑠

𝐿𝑜𝑠𝑠 𝑆 = ෍ 𝐿𝑜𝑠𝑠(𝑦𝑖^ , 𝑦𝑖 ) + 𝛼 ෍ |θ𝑗 |


𝑖 𝑗

Note:
The hyperparameter controlling the regularization strength of a Scikit-Learn LogisticRegression model is not
alpha (as in other linear models), but its inverse: C. The higher the value of C, the less the model is
regularized.

47
BITS Pilani, Pilani Campus
Evaluation of Classifiers
Using Another Example

Following contents are common for all the classifiers

BITS Pilani
BITS Pilani, Pilani Campus
Classifier Evaluation Metrics: Confusion Matrix

Confusion Matrix:
Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in
class i that were labeled by the classifier as class j
May have extra rows/columns to provide totals
• True Positive (TP): It refers to the number of
Predicted C1 ¬ C1
predictions where the classifier correctly predicts
class ->
the positive class as positive. Actual
• True Negative (TN): It refers to the number of class
predictions where the classifier correctly predicts C1 True False
the negative class as negative. Positives Negatives
(TP) (FN)
• False Positive (FP): It refers to the number of
¬ C1 False True
predictions where the classifier incorrectly Positives Negatives
predicts the negative class as positive. (FP) (TN)
• False Negative (FN): It refers to the number of
predictions where the classifier incorrectly predicts
49
the positive class as negative.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
49
Classifier Evaluation Metrics: Confusion Matrix

Confusion Matrix:
Classifier Accuracy, or recognition rate:
percentage of test set tuples that are
correctly classified
Accuracy = (TP + TN)/All
Predicted C1 ¬ C1
most effective when the class class ->
distribution is relatively balanced Actual
class
Classification Error/ Misclassification rate: C1 True False
1 – accuracy, or Positives Negatives
(TP) (FN)
= (FP + FN)/All
¬ C1 False True
Positives Negatives
(FP) (TN)

50

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


50
Evaluation of Classification Model
Confusion Matrix

Mileage Car Price PREDICTED CLASS Unseen Data


(in kmpl) (in cr) Class= Class= Mileage Car Price
Low High (in kmpl) (in cr)
9.8 High ACTUAL
Class= a b
CLASS
Low (TP) (FN) 7.5 High
9.12 Low
Class= c d 10 Low
9.5 High (FP) (TN)
High
…. …..
10 Low ad TP  TN
…. … Accuracy  
a  b  c  d TP  TN  FP  FN

Most effective when the class distribution is


relatively balanced

Model 1 Model 2
1 1
CarPrice = CarPrice =
1+𝑒 −8.5 + 0.5 Mileage – 1.5 Mileage2 5.5−1.5 Mileage
1+𝑒

Accuracy: 99% Accuracy: 50%


51
BITS Pilani, Pilani Campus
Evaluation of Classification Model
Confusion Matrix

PREDICTED CLASS PREDICTED CLASS


Class= Class= Class= Class=
Low High Low High
ACTUAL ACTUAL
Class= 0 10 Class= 10 0
CLASS CLASS
Low (TP) (FN) Low (TP) (FN)

Class= 0 990 Class= 500 490


High (FP) (TN) High (FP) (TN)

If a model predicts everything to be class ad TP  TN


NO, accuracy is 990/1000 = 99 %. This
Accuracy  
a  b  c  d TP  TN  FP  FN
is misleading because this trivial model
does not detect any class YES example
Detecting the rare class is usually more Which model is better?
interesting (e.g., frauds, intrusions,
defects, etc)
Model 1 Model 2
1 1
CarPrice = CarPrice =
1+𝑒 −8.5 + 0.5 Mileage – 1.5 Mileage2 5.5−1.5 Mileage
1+𝑒

Accuracy: 99% Accuracy: 50%


52
BITS Pilani, Pilani Campus
Evaluation of Classification Model
Confusion Matrix

PREDICTED CLASS
Class= Class=
Low High
ACTUAL
Class= a b
CLASS
Low (TP) (FN)

Class= c d
High (FP) (TN)

The F-score (also known as the F1


score or F-measure, combines precision
and recall into a single score . F1-score
is a better metric when there are
imbalanced classes ( More on this in
upcoming slides)
F-score
= 2 * (precision * recall) / (precision +
recall)
53
BITS Pilani, Pilani Campus
Evaluation of Classifiers

Given below is a confusion matrix for medical data where the class values are
yes and no for a class label attribute, cancer. Calculate the accuracy of the
classifier.

Actual Class\Predicted cancer = yes cancer = Total Recognition(%)


class no
cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56
(specificity)
Total 230 9770 10000 96.40 (accuracy)

BITS Pilani, Pilani Campus


Which Classifier is better?
Low Skew Case

Precision (p)  0.98


T1 PREDICTED CLASS TPR  Recall (r)  0.5
Class=Yes Class=No
FPR  0.01
Class=Yes 50 50 TPR/FPR  50
ACTUAL Class=No 1 99
CLASS F  measure  0.66

Precision (p)  0.9


T2 PREDICTED CLASS
TPR  Recall (r)  0.99
Class=Yes Class=No
FPR  0.1
Class=Yes 99 1
TPR/FPR  9.9
ACTUAL Class=No 10 90
CLASS F  measure  0.94

T3 PREDICTED CLASS Precision (p)  0.99


Class=Yes Class=No TPR  Recall (r)  0.99
Class=Yes 99 1 FPR  0.01
ACTUAL
Class=No 1 99 TPR/FPR  99
CLASS
F  measure  0.99

BITS Pilani, Pilani Campus


Which Classifier is better?
Medium Skew Case

Precision (p)  0.83


T1 PREDICTED CLASS TPR  Recall (r)  0.5
Class=Yes Class=No
FPR  0.01
Class=Yes 50 50 TPR/FPR  50
ACTUAL Class=No 10 990
CLASS F  measure  0.62

Precision (p)  0.5


T2 PREDICTED CLASS
TPR  Recall (r)  0.99
Class=Yes Class=No
FPR  0.1
Class=Yes 99 1
TPR/FPR  9.9
ACTUAL Class=No 100 900
CLASS F  measure  0.66

T3 PREDICTED CLASS Precision (p)  0.9


Class=Yes Class=No TPR  Recall (r)  0.99
Class=Yes 99 1 FPR  0.01
ACTUAL
Class=No 10 990 TPR/FPR  99
CLASS
F  measure  0.94

BITS Pilani, Pilani Campus


Which Classifier is better?
High Skew Case

Precision (p)  0.3


T1 PREDICTED CLASS TPR  Recall (r)  0.5
Class=Yes Class=No
FPR  0.01
Class=Yes 50 50 TPR/FPR  50
ACTUAL Class=No 100 9900
CLASS F  measure  0.375

Precision (p)  0.09


T2 PREDICTED CLASS
TPR  Recall (r)  0.99
Class=Yes Class=No
FPR  0.1
Class=Yes 99 1
TPR/FPR  9.9
ACTUAL Class=No 1000 9000
CLASS F  measure  0.165

T3 PREDICTED CLASS Precision (p)  0.5


Class=Yes Class=No TPR  Recall (r)  0.99
Class=Yes 99 1 FPR  0.01
ACTUAL
Class=No 100 9900 TPR/FPR  99
CLASS
F  measure  0.66

BITS Pilani, Pilani Campus


Which Model should you use?

False Positive False Negative


Rate Rate
Model 1 41% 3%
Model 2 5% 25%
Mistakes have different costs:
• Disease Screening – LOW FN Rate
• Spam filtering – LOW FP Rate

Conservative vs Aggressive settings:


• The same application might need multiple tradeoffs

BITS Pilani, Pilani Campus


ROC (Receiver Operating Characteristic)

• A graphical approach for


displaying trade-off between
detection rate and false alarm rate
• AUC represents degree or
measure of separability. It tells
how much model is capable of
distinguishing between classes.
• Developed in 1950s for signal
detection theory to analyze noisy
signals
• ROC curve plots TPR against FPR
(TPR,FPR):
– Performance of a model
represented as a point in an • (0,0): declare everything to be negative class
ROC curve • (1,1): declare everything to be positive class
• Usage • (1,0): ideal
– Threshold selection • Diagonal line:
– Performance assessment – Random guessing
– Classifier comparison – Below diagonal line:
• prediction is opposite of the true class
BITS Pilani, Pilani Campus
ROC (Receiver Operating Characteristic)

• To draw ROC curve, classifier must produce continuous-valued


output

– Outputs are used to rank test records, from the most likely
positive class record to the least likely positive class record

– By using different thresholds on this value, we can create


different variations of the classifier with TPR/FPR tradeoffs

• Many classifiers produce only discrete outputs (i.e., predicted


class)

– How to get continuous-valued outputs?

• Decision trees, rule-based classifiers, neural networks,


Bayesian classifiers, k-nearest neighbors, SVM

BITS Pilani, Pilani Campus


How to Construct an ROC curve

1. Use a classifier that produces a


Instance Score True Class continuous-valued score for each
instance
1 0.95 +
– The more likely it is for the instance
2 0.93 + to be in the + class, the higher the
3 0.87 - score
4 0.85 - 2. Sort the instances in decreasing order
according to the score
5 0.85 -
3. Apply a threshold at each unique value
6 0.85 + of the score
7 0.76 - 4. Count the number of TP, FP,
8 0.53 + TN, FN at each threshold
9 0.43 - • TPR = TP/(TP+FN)
10 0.25 + • FPR = FP/(FP + TN)

BITS Pilani, Pilani Campus


How to construct an ROC curve

Class + - + - - - + - + +
P
Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

ROC Curve:

BITS Pilani, Pilani Campus


Using ROC for Model Comparison

 No model consistently
outperforms the other
 M1 is better for small FPR
 M2 is better for large FPR

 Area Under the ROC curve


(AUC)
 Ideal:
 Area = 1
 Random guess:
 Area = 0.5
Higher the AUC, better the
model is at predicting

BITS Pilani, Pilani Campus


Another Example

The table below shows the probability value (column 3) returned by a


probabilistic classifier for each of the 10 tuples in a test set, sorted by
decreasing probability order. The corresponding ROC is given on right hand
side.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Common Issues in Classifiers

BITS
BITS Pilani, Pilani Pilani
Campus
Class Imbalance Problem
Problems where the classes are skewed (more records from one class than another)

• Find needle in haystack


• Lots of classification problems where the classes are skewed (more
records from one class than another)
– Credit card fraud
– Intrusion detection
– Defective products in manufacturing assembly line
Simple Techniques to Solve :
• Up-sample minority class
– randomly duplicating observations from a minority class
• Down-sample majority class
– removing random observations.
• Generate Synthetic Samples
– new samples based on the distances between the point and its nearest
neighbors

66

BITS Pilani, Pilani Campus


Class Imbalance Problem

• The main class of interest is rare.


• The data set distribution reflects a significant majority of the negative class (Eg.,
Job Offered = Yes/1) and a minority positive class (Eg., Job Offered = No/0)

• For Another Example,


– fraud detection applications, the class of interest (or positive class) is “fraud,”
– medical tests, there may be a rare class, such as “cancer”

• Accuracy might not be a good option for measuring performance in case of


class imbalance problem

67

BITS Pilani, Pilani Campus


Popular Solutions to Class Imbalance

• Generate Synthetic Samples

• New samples based on the distances between the point and its nearest
neighbors E.g. Synthetic Minority Oversampling Technique, or SMOTE class in
sklearn

• Change the performance metric : Use Recall, Precision or ROC curves instead of
accuracy

• Try different algorithms : Some algorithms as Support Vector Machines and Tree-
Based algorithms may work better with imbalanced classes. We will discuss
these post mid term
Many measures exists, but none of them may be ideal in all situations.
Significant Factors that help :
• Level of class imbalance
• Importance of TP vs FP
• Cost/Time tradeoffs 68

BITS Pilani, Pilani Campus


Dealing with Imbalanced Classes - Summary

• Many measures exists, but none of them may be ideal in all situations
– Random classifiers can have high value for many of these measures
– TPR/FPR provides important information but may not be sufficient by
itself in many practical scenarios
– Given two classifiers, sometimes you can tell that one of them is strictly
better than the other
• C1 is strictly better than C2 if C1 has strictly better TPR and FPR
relative to C2 (or same TPR and better FPR, and vice versa)
– Even if C1 is strictly better than C2, C1’s F-value can be worse than C2’s
if they are evaluated on data sets with different imbalances
– Classifier C1 can be better or worse than C2 depending on the scenario
at hand
BITS Pilani, Pilani Campus
Types of Classification Based
on the Output Labels

BITS Pilani
Types of Classification
Output Labels

• Target Concept

Examples of Multiclass:

• Email foldering
/tagging: Work,
Friends, Family, Hobby
• Medical Diagnostics:
Not ill, Cold, Flu
• Weather: Sunny,
Cloudy, Rain, Snow
71
BITS Pilani, Pilani Campus
Prediction – Multi class Classification
One Vs All Strategy(one-vs-rest)

𝑥2
1
ℎ𝜃 𝑥
𝑥1

𝑥2
2
ℎ𝜃 𝑥 𝑥2

𝑥1 𝑥1

Class 1:
Class 2:
3
Class 3: ℎ𝜃 𝑥
𝑥2

𝑖
ℎ𝜃 𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3) 𝑥1
Note: Scikit-Learn detects when you try to use a binary classification
𝑖 algorithm for a multi‐class classification task, and it automatically runs
For input x Predict ∶ max ℎ𝜃 𝑥
i OvA (except for SVM classifiers for which it uses OvO)

BITS Pilani, Pilani Campus


Prediction – Multi class Classification
One Vs One Strategy

𝑥2
1
ℎ𝜃 𝑥

𝑥1

𝑥2

ℎ𝜃
2
𝑥 𝑥2

𝑥1
𝑥1
Class 1:
Class 2:
3
Class 3: ℎ𝜃 𝑥 𝑥2

𝑖
ℎ𝜃 𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3)
𝑥1
𝑖
For input x Predict ∶ max ℎ𝜃 𝑥
i
N × (N – 1) / 2 classifiers
BITS Pilani, Pilani Campus
Logistic regression (Classification)-
Summary

• Model
1
ℎ𝜃 𝑥 = 𝑃 𝑌 = 1 𝑋1 , 𝑋2 , ⋯ , 𝑋𝑛 = ⊤
1+𝑒 −𝜃 𝑥
• Cost function
𝑚
1 −log ℎ𝜃 𝑥 if 𝑦 = 1
𝐽 𝜃 = ෍ Cost(ℎ𝜃 (𝑥 𝑖 ), 𝑦 (𝑖) )) Cost(ℎ𝜃 𝑥 , 𝑦) = ቐ
𝑚 −log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
𝑖=1

• Learning
1 𝑖
Gradient descent: Repeat {𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝑚 σ𝑚
𝑖=1 ℎ𝜃 𝑥
𝑖
−𝑦 𝑖
𝑥𝑗 }

• Inference
1
𝑌෠ = ℎ𝜃 𝑥 test = ⊤ 𝑥 test
1 + 𝑒 −𝜃
Note:
• σ(t) < 0.5 when t < 0, and σ(t) ≥ 0.5 when t ≥ 0, so a Logistic model predicts 1 if xTθ is positive, and 0 if it is negative
• logit(p) = log(p / (1 - p)), is the inverse of the logistic function. Indeed, if you compute the logit of the estimated probability
p, you will find that the result is t. The logit is also called the log-odds

BITS Pilani, Pilani Campus


Logistic Regression –Additional Practice Exercises

CGPA IQ IQ Job Offered


Hyper parameters:
5.5 6.7 100 1
5 7 105 0
Learning Rate = 0.8
8 6 90 1
Initial Weights = (-0.1, 0.2,-0.5)
9 7 105 1
Regularization Constant = 10
6 8 120 0
7.5 7.3 110 0
1
ℎ𝜃 𝑥 = ⊤𝑥
1 + 𝑒 −𝜃
For this similar problem discussed in class note that the hyper parameters are different

1. Formulate the gradient descent update equations for this problem


2. Repeat the GD for two iterations
3. Find the Loss at every iterations and interpret your observation
4. Using the results of second iteration answer below questions:
a) Interpret the influence of the CGPA in the response variable
b) Predict if a new candidates with IQ=5 and CGPA = 9 will be offered job or not?
5. Repeat the steps 2 to 4 by using stochastic gradient descent instead of batch gradient
descent for 4 iterations. (Take any random sample from among 6 instances for these 4
iterations)

BITS Pilani, Pilani Campus


Evaluation of Classifiers–Additional Practice
Exercises
Given below is a confusion matrix for medical data where the class values are
yes and no for a class label attribute, cancer. Answer the following questions.

1. Calculate the Precision , Recall, F-Score, Error-rate, F-Score


2. Brainstorm on the use case / scenarios w.r.t given example, where precision
is preferred over recall.
3. Brainstorm on the use case / scenarios w.r.t given example, where recall is
preferred over precision.

BITS Pilani, Pilani Campus


Formulation of the Gradient Descent equation for
Logistic regression from its cross entropy loss function
(Additional Reference for student’s Self Reading)

77

BITS Pilani, Pilani Campus


Logistic regression GD derivation

78

BITS Pilani, Pilani Campus


Logistic regression cost function

−log ℎ𝜃 𝑥 if 𝑦 = 1
• Cost(ℎ𝜃 𝑥 , 𝑦) = ቐ
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0

• Cost ℎ𝜃 𝑥 , 𝑦 = −𝑦 log h𝜃 x − (1 − y) log 1 − ℎ𝜃 𝑥

• If 𝑦 = 1: Cost ℎ𝜃 𝑥 , 𝑦 = −log ℎ𝜃 𝑥
• If 𝑦 = 0: Cost ℎ𝜃 𝑥 , 𝑦 = −log 1 − ℎ𝜃 𝑥

Slide credit: Andrew Ng

BITS Pilani, Pilani Campus


Step –I

Applying Chain rule and writing in terms of partial derivatives

80

BITS Pilani, Pilani Campus


Step –II

• Evaluating the partial derivative using the pattern of the derivative of the
sigmoid function.

81

BITS Pilani, Pilani Campus


Step –III

• Simplifying the terms by multiplication

82

BITS Pilani, Pilani Campus


Additional References

• Tom M. Mitchell
Generative and discriminative classifiers: Naïve Bayes and Logistic Regression
http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf

• Andrew Ng, Michael Jordan


On discriminative vs. generative classifiers: A comparison of logistic regression and naive
bayes
http://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-
of-logistic-regression-and-naive-bayes.pdf
• http://www.cs.cmu.edu/~tom/NewChapters.html
• http://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf
• https://medium.com/@sangha_deb/naive-bayes-vs-logistic-regression-a319b07a5d4c
• https://www.youtube.com/watch?v=-la3q9d7AKQ
• http://www.datasciencesmachinelearning.com/2018/11/handling-outliers-in-python.html

Interpretability
• https://christophm.github.io/interpretable-ml-book/logistic.html

BITS Pilani, Pilani Campus


Thank you !
Required Reading for completed session :
T1 - Chapter # 6 (Tom M. Mitchell, Machine Learning)
R1 – Chapter # 3,#4 (Christopher M. Bhisop, Pattern Recognition & Machine
Learning) & Refresh your MFDS course basics

Next Session Plan :


Module 5 – Decision Tree Classifier

BITS Pilani, Pilani Campus

You might also like