W4 - Logistic Regression
W4 - Logistic Regression
Spring – 2025
Logistic Regression
WM/04.02 S. 2
02/17
Let’s make an email spam filter
- Suppose you want to develop a ML-based model that detects whether a certain
email is important or spam.
WM/04.02 S. 3
https://www.youtube.com/watch?v=4o5hSxvN_-s
03/17
Let’s make an email spam filter
- Suppose you want to develop a ML-based model that detects whether a certain
email is important or spam.
From: [email protected]
Date: February 13, 2023
Subject: CS-370 Announcement
Dear students,
Please note that the first quiz on
CS-370 will be conducted on …
From: [email protected]
Date: February 13, 2023
Subject: URGENT!!!
Hello Dear,
I am a Nigerian prince, and I have
a business proposal for you ...
WM/04.02 S. 4
https://www.youtube.com/watch?v=4o5hSxvN_-s
03/17
Let’s make an email spam filter
- Suppose you want to develop a ML-based model that detects whether a certain
email is important or spam.
From: [email protected]
Date: February 13, 2023
- Input: 𝒙 = 𝑒𝑚𝑎𝑖𝑙 𝑚𝑒𝑠𝑠𝑎𝑔𝑒
Subject: CS-370 Announcement
- Output: 𝑦 ∈ {𝑠𝑝𝑎𝑚, 𝑛𝑜𝑡 − 𝑠𝑝𝑎𝑚} or 𝑦 ∈ {+1, −1}
Dear students,
- Objective: Learn a predictor 𝑓 such that, Please note that the first quiz on
CS-370 will be conducted on …
𝒙 Model 𝑦 ∈ {1, 0}
From: [email protected]
Date: February 13, 2023
Subject: URGENT!!!
Hello Dear,
I am a Nigerian prince, and I have
a business proposal for you ...
WM/04.02 S. 5
https://www.youtube.com/watch?v=4o5hSxvN_-s
03/17
Let’s make an email spam filter
- Suppose you want to develop a ML-based model that detects whether a certain
email is important or spam.
From: [email protected]
Date: February 13, 2023
- Input: 𝒙 = 𝑒𝑚𝑎𝑖𝑙 𝑚𝑒𝑠𝑠𝑎𝑔𝑒
Subject: CS-370 Announcement
- Output: 𝑦 ∈ {𝑠𝑝𝑎𝑚, 𝑛𝑜𝑡 − 𝑠𝑝𝑎𝑚} or 𝑦 ∈ {+1, −1}
Dear students,
- Objective: Learn a predictor 𝑓 such that, Please note that the first quiz on
CS-370 will be conducted on …
𝒙 Model 𝑦 ∈ {1, 0}
From: [email protected]
Date: February 13, 2023
- The training dataset is,
Subject: URGENT!!!
𝒟𝑡𝑟𝑎𝑖𝑛 = [("… CS−370 …", −1), Partial
Specifications of Hello Dear,
("… 10 million USD …", +1),
I am a Nigerian prince, and I have
("… PVC pipes at redued …", +1)] behaviour
a business proposal for you ...
WM/04.02 S. 6
https://www.youtube.com/watch?v=4o5hSxvN_-s
03/17
In machine learning, input features are hand-crafted
- Let the example task is to predict whether a string 𝑥 is a valid email address.
WM/04.02 S. 7
04/17
In machine learning, input features are hand-crafted
- Let the example task is to predict whether a string 𝑥 is a valid email address.
Length>10 : True 1
fracOfAlphabets : 0.85 0.85
Feature
𝑎𝑏𝑐@𝑔𝑚𝑎𝑖𝑙. 𝑐𝑜𝑚 Contains_@ : True 1
Extractor
endsWith_.com : True 1
endsWith_.edu : False 0
WM/04.02 S. 8
04/17
In machine learning, input features are hand-crafted
- Let the example task is to predict whether a string 𝑥 is a valid email address.
Length>10 : True 1
fracOfAlphabets : 0.85 0.85
Feature
𝑎𝑏𝑐@𝑔𝑚𝑎𝑖𝑙. 𝑐𝑜𝑚 Contains_@ : True 1
Extractor
endsWith_.com : True 1
endsWith_.edu : False 0
- For an input 𝑥, its feature vector is,
𝝓 𝑥 = [𝜙1 𝑥 , 𝜙2 𝑥 , … , 𝜙𝑑 (𝑥)]
WM/04.02 S. 9
04/17
A linear classifier calculates scores to predict classes
Weight vector 𝒘 ∈ ℝ𝒅
Length>10 : -1.2
fracOfAlphabets : 0.6
Contains_@ :3
endsWith_.com : 2.2
endsWith_.edu : 2.8
WM/04.02 S. 10
05/17
A linear classifier calculates scores to predict classes
𝑚𝑎𝑟𝑔𝑖𝑛 = 𝒘. 𝜙 𝑥 . 𝑦
Weight vector 𝒘 ∈ ℝ𝒅
Length>10 : -1.2
fracOfAlphabets : 0.6
Contains_@ :3
endsWith_.com : 2.2
endsWith_.edu : 2.8
WM/04.02 S. 11
05/17
A linear classifier calculates scores to predict classes
𝑚𝑎𝑟𝑔𝑖𝑛 = 𝒘. 𝜙 𝑥 . 𝑦
Weight vector 𝒘 ∈ ℝ𝒅
- A linear classifier maps the scores to the given classes using appropriate function. Length>10 : -1.2
fracOfAlphabets : 0.6
+1, 𝑖𝑓 𝒘. 𝜙 𝑥 > 0 Contains_@ :3
𝑓𝑤 𝑥 = 𝑠𝑖𝑔𝑛 𝒘. 𝜙 𝑥 = ൞−1, 𝑖𝑓 𝒘. 𝜙 𝑥 < 0 endsWith_.com : 2.2
?, 𝑖𝑓 𝒘. 𝜙 𝑥 = 0 endsWith_.edu : 2.8
WM/04.02 S. 12
05/17
Relationship between data and weights can be visualised on 2D plan
- Let we have 𝑦
𝑓𝑤 𝑥 = 𝑠𝑖𝑔𝑛 𝒘. 𝜙 𝑥
𝑤 = 2, −1 5
4.5
𝝓 𝑥 ∈ 2,0 , 0,2 , [2,4] 4
3.5
2.5
1.5
0.5
WM/04.02 S. 13
06/17
Relationship between data and weights can be visualised on 2D plan
- Let we have 𝑦
𝑓𝑤 𝑥 = 𝑠𝑖𝑔𝑛 𝒘. 𝜙 𝑥
𝑤 = 2, −1 5
4.5
𝝓 𝑥 ∈ 2,0 , 0,2 , [2,4] 4
3.5
2.5
1.5
0.5
WM/04.02 S. 14
06/17
Relationship between data and weights can be visualised on 2D plan
- Let we have 𝑦
𝑓𝑤 𝑥 = 𝑠𝑖𝑔𝑛 𝒘. 𝜙 𝑥
𝑤 = 2, −1 5
4.5
𝝓 𝑥 ∈ 2,0 , 0,2 , [2,4] 4
3.5
2.5
2
−
1.5
0.5
+
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 𝑥
WM/04.02 S. 15
06/17
Relationship between data and weights can be visualised on 2D plan
- Let we have 𝑦
𝑓𝑤 𝑥 = 𝑠𝑖𝑔𝑛 𝒘. 𝜙 𝑥
𝑤 = 2, −1 5
4.5
𝝓 𝑥 ∈ 2,0 , 0,2 , [2,4] 4
3.5
2.5
2
−
1.5
0.5
+
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 𝑥
WM/04.02 S. 16
06/17
Relationship between data and weights can be visualised on 2D plan
- Let we have 𝑦
𝑓𝑤 𝑥 = 𝑠𝑖𝑔𝑛 𝒘. 𝜙 𝑥
𝑤 = 2, −1 5
4.5
𝝓 𝑥 ∈ 2,0 , 0,2 , [2,4] 4
3.5
- In general, a binary classifier 𝑓𝑤 defines a hyperplane decision boundary with
normal vector 𝑤. 3
2.5
0.5
+
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 𝑥
WM/04.02 S. 17
06/17
Which loss function is suitable for a binary classifier?
𝐿𝑜𝑠𝑠 𝒙, 𝑦, 𝒘
4
3.5
3
2.5
2
1.5
1
0.5
WM/04.02 S. 18
07/17
Which loss function is suitable for a binary classifier?
𝐿𝑜𝑠𝑠 𝒙, 𝑦, 𝒘
4
3.5
3
- What are the pros and cons of zero-one loss? 2.5
2
∞, 𝑖𝑓 𝑚𝑎𝑟𝑔𝑖𝑛 = 0 1.5
𝛁𝐿𝑜𝑠𝑠0−1 𝒙, 𝑦, 𝒘 = ቊ 1
0, 𝑖𝑓 𝑚𝑎𝑟𝑔𝑖𝑛 ≠ 0 0.5
WM/04.02 S. 19
07/17
Which loss function is suitable for a binary classifier?
𝐿𝑜𝑠𝑠 𝒙, 𝑦, 𝒘
4
3.5
3
- What are the pros and cons of zero-one loss? 2.5
2
∞, 𝑖𝑓 𝑚𝑎𝑟𝑔𝑖𝑛 = 0 1.5
𝛁𝐿𝑜𝑠𝑠0−1 𝒙, 𝑦, 𝒘 = ቊ 1
0, 𝑖𝑓 𝑚𝑎𝑟𝑔𝑖𝑛 ≠ 0 0.5
𝐿𝑜𝑠𝑠 𝒙, 𝑦, 𝒘
4
3.5
3
2.5
2
1.5
1
0.5
WM/04.02 S. 20
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 𝑥
(𝒘.𝜙(𝑥))𝑦
07/17
Which loss function is suitable for a binary classifier?
𝐿𝑜𝑠𝑠 𝒙, 𝑦, 𝒘
4
3.5
3
- What are the pros and cons of zero-one loss? 2.5
2
∞, 𝑖𝑓 𝑚𝑎𝑟𝑔𝑖𝑛 = 0 1.5
𝛁𝐿𝑜𝑠𝑠0−1 𝒙, 𝑦, 𝒘 = ቊ 1
0, 𝑖𝑓 𝑚𝑎𝑟𝑔𝑖𝑛 ≠ 0 0.5
𝐿𝑜𝑠𝑠 𝒙, 𝑦, 𝒘
4
- It has non-trivial gradient. 3.5
3
2.5
0, 𝑖𝑓 𝑚𝑎𝑟𝑔𝑖𝑛 > 1 2
𝛁𝐿𝑜𝑠𝑠ℎ𝑖𝑛𝑔𝑒 𝒙, 𝑦, 𝒘 = ቊ 1.5
−𝜙 𝑥 . 𝑦, 𝑖𝑓 𝑚𝑎𝑟𝑔𝑖𝑛 ≤ 1
1
0.5
WM/04.02 S. 21
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 𝑥
(𝒘.𝜙(𝑥))𝑦
07/17
Which loss function is suitable for a binary classifier (continued)?
- Logistic loss or sigmoid loss is the most common loss function for binary
classification. 𝑦
5
0-1 Loss
4.5
𝐿𝑜𝑠𝑠𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝒙, 𝑦, 𝒘 = log 1 + 𝑒 −𝑚𝑎𝑟𝑔𝑖𝑛 Hinge Loss
𝐿𝑜𝑠𝑠 𝒙, 𝑦, 𝒘
4
3.5 Logistic Loss
3
2.5
- It tries to increase the margin even when it already exceeds 1. 2
1.5
1
0.5
WM/04.02 S. 22
08/17
Which loss function is suitable for a binary classifier (continued)?
- Logistic loss or sigmoid loss is the most common loss function for binary
classification. 𝑦
5
0-1 Loss
4.5
𝐿𝑜𝑠𝑠𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝒙, 𝑦, 𝒘 = log 1 + 𝑒 −𝑚𝑎𝑟𝑔𝑖𝑛 Hinge Loss
𝐿𝑜𝑠𝑠 𝒙, 𝑦, 𝒘
4
3.5 Logistic Loss
3
2.5
- It tries to increase the margin even when it already exceeds 1. 2
1.5
1
- Why is it important to take the average loss to update weights? 0.5
WM/04.02 S. 23
08/17
Which loss function is suitable for a binary classifier (continued)?
- Logistic loss or sigmoid loss is the most common loss function for binary
classification. 𝑦
5
0-1 Loss
4.5
𝐿𝑜𝑠𝑠𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝒙, 𝑦, 𝒘 = log 1 + 𝑒 −𝑚𝑎𝑟𝑔𝑖𝑛 Hinge Loss
𝐿𝑜𝑠𝑠 𝒙, 𝑦, 𝒘
4
3.5 Logistic Loss
3
2.5
- It tries to increase the margin even when it already exceeds 1. 2
1.5
1
- Why is it important to take the average loss to update weights? 0.5
- Taking average of the three examples helps update weights that satisfy most
training examples.
WM/04.02 S. 24
08/17
Classification task can have multiple variations
- Multiclass classification:
WM/04.02 S. 25
09/17
Classification task can have multiple variations
- Multiclass classification:
- Ranking:
WM/04.02 S. 26
09/17
Classification task can have multiple variations
- Multiclass classification:
- Ranking:
- Structured Prediction:
WM/04.02 S. 27
09/17
A linear classifier can only fit data that are linearly separable
WM/04.02 S. 28
https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
10/17
A linear classifier can only fit data that are linearly separable
WM/04.02 S. 29
https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
10/17
A linear classifier can only fit data that are linearly separable
WM/04.02 S. 30
https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
10/17
A linear classifier can only fit data that are linearly separable
WM/04.02 S. 31
https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
10/17
A linear classifier can only fit data that are linearly separable
WM/04.02 S. 32
https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
10/17
A linear classifier can only fit data that are linearly separable
https://www.youtube.com/watch?v=3liCbRZPrZA
WM/04.02 S. 33
https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html
10/17
Logistic Regression is binary classification using logistic function
1
𝑓𝑤 𝑥 = 𝜎 𝒘. 𝜙 𝑥 =
1 + 𝑒 −𝒘.𝜙 𝑥
- If 𝑦 = 1, we want 𝑓𝑤 𝑥 ≈ 1, 𝒘. 𝜙 𝑥 ≫ 0.
𝜎 𝒘. 𝜙 𝑥
- If 𝑦 = 0, we want 𝑓𝑤 𝑥 ≈ 0, 𝒘. 𝜙 𝑥 ≪ 0.
𝒘. 𝜙 𝑥
WM/04.02 S. 34
11/17
Logistic Regression is binary classification using logistic function
1
𝑓𝑤 𝑥 = 𝜎 𝒘. 𝜙 𝑥 =
1 + 𝑒 −𝒘.𝜙 𝑥
- If 𝑦 = 1, we want 𝑓𝑤 𝑥 ≈ 1, 𝒘. 𝜙 𝑥 ≫ 0.
𝜎 𝒘. 𝜙 𝑥
- If 𝑦 = 0, we want 𝑓𝑤 𝑥 ≈ 0, 𝒘. 𝜙 𝑥 ≪ 0.
𝒘. 𝜙 𝑥
WM/04.02 S. 35
11/17
How to calculate the cost of an example using Negative Log Likelihood?
WM/04.02 S. 36
12/17
How to calculate the cost of an example using Negative Log Likelihood?
If 𝑦 = 1, we want 𝑓𝑤 𝑥 ≈ 1, 𝒘. 𝜙 𝑥 ≫ 0
WM/04.02 S. 37
12/17
How to calculate the cost of an example using Negative Log Likelihood?
If 𝑦 = 1, we want 𝑓𝑤 𝑥 ≈ 1, 𝒘. 𝜙 𝑥 ≫ 0 If 𝑦 = 0, we want 𝑓𝑤 𝑥 ≈ 0, 𝒘. 𝜙 𝑥 ≪ 0
WM/04.02 S. 38
12/17
Decision boundary may or may not be linear
WM/04.02 S. 39
13/17
Decision boundary may or may not be linear
- Let,
𝑓𝑤 𝑥 = 𝒘. 𝜙 𝑥 = 𝑔 𝑤1 𝜙1 𝑥 + 𝑤2 𝜙2 𝑥 + 𝑤3 𝜙12 𝑥 + 𝑤4 𝜙22 𝑥
- Set,
𝒘 = [0, 0, 1, 1] and 𝑏 = 1.
WM/04.02 S. 40
13/17
Naïve Bayes Classifier is a simple probabilistic classifier
Dear
8
Friend
𝑝 𝐷𝑒𝑎𝑟 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.47
17
Lunch
5
𝑝 𝐹𝑟𝑖𝑒𝑛𝑑 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.29
Money
17
3
𝑝 𝐿𝑢𝑛𝑐ℎ 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.18
17
1 Normal
𝑝 𝑀𝑜𝑛𝑒𝑦 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.06
17
Money
Lunch
Dear
Friend
Spam
WM/04.02 S. 41
14/17
Naïve Bayes Classifier is a simple probabilistic classifier
Dear
8
Friend
𝑝 𝐷𝑒𝑎𝑟 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.47
17
Lunch
5
𝑝 𝐹𝑟𝑖𝑒𝑛𝑑 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.29
Money
17
3
𝑝 𝐿𝑢𝑛𝑐ℎ 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.18
17
1 Normal
𝑝 𝑀𝑜𝑛𝑒𝑦 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.06
17
2
𝑝 𝐷𝑒𝑎𝑟 𝑆𝑝𝑎𝑚 = = 0.29
7
1
𝑝 𝐹𝑟𝑖𝑒𝑛𝑑 𝑆𝑝𝑎𝑚 = = 0.14
7
0
Money
𝑝 𝐿𝑢𝑛𝑐ℎ 𝑆𝑝𝑎𝑚 = = 0.0
Lunch
Dear
7
Friend
4
𝑝 𝑀𝑜𝑛𝑒𝑦 𝑆𝑝𝑎𝑚 = = 0.57
7
Spam
WM/04.02 S. 42
14/17
Naïve Bayes Classifier is a simple probabilistic classifier
Dear
8
Friend
𝑝 𝐷𝑒𝑎𝑟 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.47
17
Lunch
5
𝑝 𝐹𝑟𝑖𝑒𝑛𝑑 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.29
Money
17
3
𝑝 𝐿𝑢𝑛𝑐ℎ 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.18
17
1 Normal
𝑝 𝑀𝑜𝑛𝑒𝑦 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.06
17
2
𝑝 𝐷𝑒𝑎𝑟 𝑆𝑝𝑎𝑚 = = 0.29
7
1
𝑝 𝐹𝑟𝑖𝑒𝑛𝑑 𝑆𝑝𝑎𝑚 = = 0.14
7
0
Money
𝑝 𝐿𝑢𝑛𝑐ℎ 𝑆𝑝𝑎𝑚 = = 0.0
Lunch
Dear
7
Friend
4
𝑝 𝑀𝑜𝑛𝑒𝑦 𝑆𝑝𝑎𝑚 = = 0.57
7
14/17
Naïve Bayes does not pay attention to the sequence of inputs
Dear
Friend
Lunch
Money
Normal
Money
Lunch
Dear
Friend
Spam
WM/04.02 S. 44
15/17
Naïve Bayes does not pay attention to the sequence of inputs
Dear
- Start with a Prior probability.
Friend
8
𝑃 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.67
Lunch
12
Money
4
𝑃 𝑆𝑝𝑎𝑚 = = 0.33
12
Normal
Money
Lunch
Dear
Friend
Spam
WM/04.02 S. 45
15/17
Naïve Bayes does not pay attention to the sequence of inputs
Dear
- Start with a Prior probability.
Friend
8
𝑃 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.67
Lunch
12
Money
4
𝑃 𝑆𝑝𝑎𝑚 = = 0.33
12
Normal
- Multiply prior with the likelihood of the message.
Money
Lunch
Dear
Friend
Spam
WM/04.02 S. 46
15/17
Naïve Bayes does not pay attention to the sequence of inputs
Dear
- Start with a Prior probability.
Friend
8
𝑃 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.67
Lunch
12
Money
4
𝑃 𝑆𝑝𝑎𝑚 = = 0.33
12
Normal
- Multiply prior with the likelihood of the message.
Money
Lunch
Dear
Friend
- The equality above is a proportionality.
Spam
WM/04.02 S. 47
15/17
Naïve Bayes does not pay attention to the sequence of inputs
Dear
- Start with a Prior probability.
Friend
8
𝑃 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.67
Lunch
12
Money
4
𝑃 𝑆𝑝𝑎𝑚 = = 0.33
12
Normal
- Multiply prior with the likelihood of the message.
Money
Lunch
Dear
Friend
- The equality above is a proportionality.
- Redo the problem with the sentence ‘lunch money money money’. Spam
WM/04.02 S. 48
- Add a constant 𝛼 to each histogram.
15/17
What did we cover so far?
Classification Regression
Predictor Function 𝑠𝑖𝑔𝑛(𝑠𝑐𝑜𝑟𝑒) 𝑠𝑐𝑜𝑟𝑒
Relates to correct 𝑦 Margin (𝑠𝑐𝑜𝑟𝑒 × 𝑦) Residual (𝑠𝑐𝑜𝑟𝑒 − 𝑦)
0–1
Mean Absolute Error
Loss Functions Hinge
Mean Squared Error
Logistic
WM/04.02 S. 49
16/17
Summary of today’s lecture
WM/04.02 S. 50
17/17
Summary of today’s lecture
- Next Lecture:
WM/04.02 S. 51
17/17
Do you have any problem?
WM/04.02 S. 52
EOP