AIML ML Session 5 Session 6 - Student Common Reference (With More Additional Read
AIML ML Session 5 Session 6 - Student Common Reference (With More Additional Read
AIML CZG565
M4 : Linear Models for Classification
• These content of modules & context under topics are planned by the course owner Dr. Sugata,
with grateful acknowledgement to many others who made their course materials freely
available online.
• The content for these slides has been obtained from books and various other source on the
Internet
• We here by acknowledge all the contributors for their material and inputs.
• We have provided source information wherever necessary
• To ease student’s reading , we have added additional slides in this canvas upload, that are not
shown in the live class for detailed explanation
• Students are requested to refer to the textbook w.r.t detailed content of this presentation deck
shared over canvas
Slide Source / Preparation / Review:
From BITS Pilani WILP: Prof.Sugata, Prof.Chetana, Prof.Rajavadhana, Prof.Monali, Prof.Sangeetha, Prof.Swarna,
Prof.Pankaj
External: CS109 and CS229 Stanford lecture notes, Dr.Andrew NG and many others who made their course
materials freely available online.
• Discriminant Functions
• Logistic Regression
BITS Pilani
Classification
• Target Concept : t
7
BITS Pilani, Pilani Campus
Decision Theory
• Target Concept : t
• Target Concept : t
Job -
CPGA IQ
Offered Learning
5.5 6.7 1 algorithm
5 7 0
8 6 1
9 7 1
6 8 0 Induction/ Learn p(x,Cjob=1) & p(x,Cjob =0)
7.5 7.3 0 Inference Model for
step p(x,Ck)
Model
Training Set
Apply Model
Job - to find
CPGA IQ optimal t
Offered
3 4 ?
7 6 ? Deduction/
5.5 8 ? Decision Step
Test Set
11
BITS Pilani, Pilani Campus
Sample Rule / Hypothesis:
Decision Region IF CGPA>7 Job = 1
Else Job = 0
Training Set
Job - Model divides the input
CPGA IQ
Offered Learning space into regions Rk called
5.5 6.7 1 algorithm decision regions, one for
5 7 0 each class, such that all
8 6 1
points in Rk are assigned to
class Ck
9 7 1
A mistake occurs when an
6 8 0 Induction/ input vector belonging to
Learn
7.5 7.3 0 Inference Model for class C1 is assigned to class
step p(x,Ck) C2
Probability Probability
Model
CGPA = 7 CGPA = 7
Prob.Dist for
Job = 1
Prob.Dist for
Job = 0
CGPA CGPA 12
BITS Pilani, Pilani Campus
Misclassification Rate
To minimize p(mistake),
each x is assigned to
whichever class has the
smaller value of the
integrand
13
BITS Pilani, Pilani Campus
Decision Theory - Summary
Probability CGPA = 7
CGPA = 8
CGPA
14
BITS Pilani, Pilani Campus
Linear Models for Classification
BITS Pilani
Types of Classification
Decision Theory: Interpretation Model Building
Generative
IQ
x2
CGPA
Job -
x1c CPGA IQ
Offered
5.5 6.7 1
P( X 1 X 2 X d | Y ) P(Y ) 5 7 0
P(Y | X 1 X 2 X n )
P( X 1 X 2 X d ) 8 6 1
Discriminative
𝜃0 + 𝜃𝑖 𝑥𝑖 y=1
𝑖
IQ
x2
Decision
y=0 Boundary
CGPA
x1 Job -
CPGA IQ
Offered
5.5 6.7 1
𝜃0 + 𝜃𝑖 𝑥𝑖 ≥ 0
5 7 0
𝑖
8 6 1
𝜃0 + 𝜃𝑖 𝑥𝑖 < 0
9 7 1
𝑖
Logistic regression, SVMs , tree based 6 8 0
classifiers (e.g. decision tree) Traditional neural 7.5 7.3 0
networks, Nearest neighbor … …. …..
IQ IQ
• Discriminative Model
Idea :
Given data X and associated binary (0/1) class label Y, Logistic Regression
tries to learn a discriminant function P(Y|X)
If Y = 1, P(Y|X) = 1 else P(Y|X) = 0
BITS
BITS Pilani, Pilani Pilani
Campus
Logistic Regression vs Least Squares Regression
Figure depicts
sample data points
(Yes) 1
Job Offered ?
(No) 0
CGPA or IQ
• Classifier:
f (x,w) = wo + wT x (linear discriminant function)
• Decision rule is
• Mathematically
y = sign(w0 + wT x)
w0 + wT x = 0
A discriminant is a function that takes an input vector x and assigns it
to one of K classes, denoted Ck.
(Yes) 1
Job Offered ?
(No) 0
CGPA or IQ
least squares
regression
The right-hand plot shows the corresponding results obtained when extra data points
are added at the bottom left of the diagram, showing that least squares is highly
sensitive to outliers, unlike logistic regression.
24
• Logistic Regression could help use predict whether the student passed
or failed. Logistic regression predictions are discrete (only specific values
or categories are allowed). We can also view probability scores underlying
the model’s classifications.
Intuition behind the model:
Classification requires discrete values: y = 0 or 1
For linear Regression output values: ℎ𝜃 𝑥 can be much > 1 or much < 0
Logistic Regression:
25
• Sigmoid/logistic
function takes a real
value as input and
outputs another value
between 0 and 1
• That framework is
called logistic
regression
– Logistic: A special
mathematical
sigmoid function it
uses
– Regression:
Combines a weight
vector with
observations to
create an answer
26
Classification: y = 0 or 1 𝑌 = 𝜃0 + 𝜃𝑖 𝑥𝑖
𝑖
Another drawback
h(x) can be > 1 or < 0 of using linear
regression for this
problem
What we need: 𝑝
ln = 𝜃0 + 𝜃𝑖 𝑥𝑖
1−𝑝
𝑖
Logistic Regression:
• ℎ𝜃 𝑥 = 𝑔 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2
– e.g., 𝜃0 = −3, 𝜃1 = 1, 𝜃2 = 1
Figure depicts
sample data points
Decision
IQ boundary
CGPA
• Predict “𝑦 = 1” if −3 + 𝑥1 + 𝑥2 ≥ 0
CGPA
Figure depicts
sample data points
CGPA
IQ
• Training set:
• m examples
31
BITS
BITS Pilani, Pilani Pilani
Campus
Logistic Regression
• Training set:
• How to choose parameters (feature weights)?
“non-convex” “convex”
𝜃 𝜃
Cross Entropy
• Cross-entropy loss, or log loss, measures the performance of a classification
model whose output is a probability value between 0 and 1.
• Cross-entropy loss increases as the predicted probability diverges from the
actual label. So predicting a probability of .012 when the actual observation label
is 1 would be bad and result in a high loss value.
• A perfect model would have a log loss of 0.
• Cross-entropy loss can be divided into two separate cost functions: one for y=1
and one for y=0.
BITS Pilani, Pilani Campus
Logistic regression cost function (cross entropy)
If y = 1
ie.,
Actual Value Job
Offered = 1
Cost
0 1 Predicted Value
Predicted Value ℎ𝜃 𝑥 Job Offered = 1
Job Offered = 0
If y = 0
ie.,
Actual Value
Job Offered = 0
Cost=0; If y=0 and hꝊ(x)=0
Cost
Predicted Value
0 1 Job Offered = 1
Predicted Value ℎ𝜃 𝑥
Job Offered = 0
𝑚
1
𝐽 𝜃 =− 𝑦 (𝑖) log ℎ𝜃 𝑥 (𝑖) + (1 − 𝑦 (𝑖) ) log 1 − ℎ𝜃 𝑥 (𝑖)
𝑚
𝑖=1
Linear Regression
Repeat {
𝑚
1 𝑖 (𝑖)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ℎ𝜃 𝑥 − 𝑦 (𝑖) 𝑥𝑗 ℎ𝜃 𝑥 = 𝜃 ⊤ 𝑥
𝑚
𝑖=1
}
Logistic Regression
Repeat {
𝑚
1 𝑖 (𝑖)
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 ℎ𝜃 𝑥 − 𝑦 (𝑖) 𝑥𝑗
𝑚 1
𝑖=1
ℎ𝜃 𝑥 = ⊤𝑥
} 1 + 𝑒 −𝜃
For k=R
Eg., If the class has three distinct values with assumption that the dataset has 6 features
and linear decision boundary works:
Classifier learns two set of weights. Each set of weight has {𝜃0 , 𝜃1, 𝜃2……. 𝜃6}
For the third class value(label: k=Rth value) it uses the second formula for estimation.
BITS Pilani
BITS Pilani, Pilani Campus
Example: Sentiment Analysis – With Engineered features
Sentiment Features
Note :
The exponential function of the regression coefficient (ew-cpga) is the odds ratio associated with a one-unit
increase in the cgpa.
+ The odd of being offered with job increase by a factor of 1.35 for every unit increase in the CGPA
[np.exp(model.params)]
Note : This topic is already covered in the module 3 and the implementation
remains the same. Only one points added here w.r.t interpretation for
logistic regression
BITS Pilani
BITS Pilani, Pilani Campus
Ways to Control Overfitting – Interpretation of Hyper parameter
• Regularization
𝑛 # 𝑊𝑒𝑖𝑔ℎ𝑡𝑠
Note:
The hyperparameter controlling the regularization strength of a Scikit-Learn LogisticRegression model is not
alpha (as in other linear models), but its inverse: C. The higher the value of C, the less the model is
regularized.
47
BITS Pilani, Pilani Campus
Evaluation of Classifiers
Using Another Example
BITS Pilani
BITS Pilani, Pilani Campus
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in
class i that were labeled by the classifier as class j
May have extra rows/columns to provide totals
• True Positive (TP): It refers to the number of
Predicted C1 ¬ C1
predictions where the classifier correctly predicts
class ->
the positive class as positive. Actual
• True Negative (TN): It refers to the number of class
predictions where the classifier correctly predicts C1 True False
the negative class as negative. Positives Negatives
(TP) (FN)
• False Positive (FP): It refers to the number of
¬ C1 False True
predictions where the classifier incorrectly Positives Negatives
predicts the negative class as positive. (FP) (TN)
• False Negative (FN): It refers to the number of
predictions where the classifier incorrectly predicts
49
the positive class as negative.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
49
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Classifier Accuracy, or recognition rate:
percentage of test set tuples that are
correctly classified
Accuracy = (TP + TN)/All
Predicted C1 ¬ C1
most effective when the class class ->
distribution is relatively balanced Actual
class
Classification Error/ Misclassification rate: C1 True False
1 – accuracy, or Positives Negatives
(TP) (FN)
= (FP + FN)/All
¬ C1 False True
Positives Negatives
(FP) (TN)
50
Model 1 Model 2
1 1
CarPrice = CarPrice =
1+𝑒 −8.5 + 0.5 Mileage – 1.5 Mileage2 5.5−1.5 Mileage
1+𝑒
PREDICTED CLASS
Class= Class=
Low High
ACTUAL
Class= a b
CLASS
Low (TP) (FN)
Class= c d
High (FP) (TN)
Given below is a confusion matrix for medical data where the class values are
yes and no for a class label attribute, cancer. Calculate the accuracy of the
classifier.
– Outputs are used to rank test records, from the most likely
positive class record to the least likely positive class record
Class + - + - - - + - + +
P
Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
ROC Curve:
No model consistently
outperforms the other
M1 is better for small FPR
M2 is better for large FPR
BITS
BITS Pilani, Pilani Pilani
Campus
Class Imbalance Problem
Problems where the classes are skewed (more records from one class than another)
66
67
• New samples based on the distances between the point and its nearest
neighbors E.g. Synthetic Minority Oversampling Technique, or SMOTE class in
sklearn
• Change the performance metric : Use Recall, Precision or ROC curves instead of
accuracy
• Try different algorithms : Some algorithms as Support Vector Machines and Tree-
Based algorithms may work better with imbalanced classes. We will discuss
these post mid term
Many measures exists, but none of them may be ideal in all situations.
Significant Factors that help :
• Level of class imbalance
• Importance of TP vs FP
• Cost/Time tradeoffs 68
• Many measures exists, but none of them may be ideal in all situations
– Random classifiers can have high value for many of these measures
– TPR/FPR provides important information but may not be sufficient by
itself in many practical scenarios
– Given two classifiers, sometimes you can tell that one of them is strictly
better than the other
• C1 is strictly better than C2 if C1 has strictly better TPR and FPR
relative to C2 (or same TPR and better FPR, and vice versa)
– Even if C1 is strictly better than C2, C1’s F-value can be worse than C2’s
if they are evaluated on data sets with different imbalances
– Classifier C1 can be better or worse than C2 depending on the scenario
at hand
BITS Pilani, Pilani Campus
Types of Classification Based
on the Output Labels
BITS Pilani
Types of Classification
Output Labels
• Target Concept
Examples of Multiclass:
• Email foldering
/tagging: Work,
Friends, Family, Hobby
• Medical Diagnostics:
Not ill, Cold, Flu
• Weather: Sunny,
Cloudy, Rain, Snow
71
BITS Pilani, Pilani Campus
Prediction – Multi class Classification
One Vs All Strategy(one-vs-rest)
𝑥2
1
ℎ𝜃 𝑥
𝑥1
𝑥2
2
ℎ𝜃 𝑥 𝑥2
𝑥1 𝑥1
Class 1:
Class 2:
3
Class 3: ℎ𝜃 𝑥
𝑥2
𝑖
ℎ𝜃 𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3) 𝑥1
Note: Scikit-Learn detects when you try to use a binary classification
𝑖 algorithm for a multi‐class classification task, and it automatically runs
For input x Predict ∶ max ℎ𝜃 𝑥
i OvA (except for SVM classifiers for which it uses OvO)
𝑥2
1
ℎ𝜃 𝑥
𝑥1
𝑥2
ℎ𝜃
2
𝑥 𝑥2
𝑥1
𝑥1
Class 1:
Class 2:
3
Class 3: ℎ𝜃 𝑥 𝑥2
𝑖
ℎ𝜃 𝑥 = 𝑃 𝑦 = 𝑖 𝑥; 𝜃 (𝑖 = 1, 2, 3)
𝑥1
𝑖
For input x Predict ∶ max ℎ𝜃 𝑥
i
N × (N – 1) / 2 classifiers
BITS Pilani, Pilani Campus
Logistic regression (Classification)-
Summary
• Model
1
ℎ𝜃 𝑥 = 𝑃 𝑌 = 1 𝑋1 , 𝑋2 , ⋯ , 𝑋𝑛 = ⊤
1+𝑒 −𝜃 𝑥
• Cost function
𝑚
1 −log ℎ𝜃 𝑥 if 𝑦 = 1
𝐽 𝜃 = Cost(ℎ𝜃 (𝑥 𝑖 ), 𝑦 (𝑖) )) Cost(ℎ𝜃 𝑥 , 𝑦) = ቐ
𝑚 −log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
𝑖=1
• Learning
1 𝑖
Gradient descent: Repeat {𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝑚 σ𝑚
𝑖=1 ℎ𝜃 𝑥
𝑖
−𝑦 𝑖
𝑥𝑗 }
• Inference
1
𝑌 = ℎ𝜃 𝑥 test = ⊤ 𝑥 test
1 + 𝑒 −𝜃
Note:
• σ(t) < 0.5 when t < 0, and σ(t) ≥ 0.5 when t ≥ 0, so a Logistic model predicts 1 if xTθ is positive, and 0 if it is negative
• logit(p) = log(p / (1 - p)), is the inverse of the logistic function. Indeed, if you compute the logit of the estimated probability
p, you will find that the result is t. The logit is also called the log-odds
77
78
−log ℎ𝜃 𝑥 if 𝑦 = 1
• Cost(ℎ𝜃 𝑥 , 𝑦) = ቐ
−log 1 − ℎ𝜃 𝑥 if 𝑦 = 0
• If 𝑦 = 1: Cost ℎ𝜃 𝑥 , 𝑦 = −log ℎ𝜃 𝑥
• If 𝑦 = 0: Cost ℎ𝜃 𝑥 , 𝑦 = −log 1 − ℎ𝜃 𝑥
80
• Evaluating the partial derivative using the pattern of the derivative of the
sigmoid function.
81
82
• Tom M. Mitchell
Generative and discriminative classifiers: Naïve Bayes and Logistic Regression
http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
Interpretability
• https://christophm.github.io/interpretable-ml-book/logistic.html