0% found this document useful (0 votes)
9 views52 pages

W4 - Logistic Regression

The document outlines a lecture on Machine Learning focusing on logistic regression and its application in designing a spam filter. It discusses the problem setting, loss functions, and various algorithms such as Naïve Bayes and Nearest Neighbour. Additionally, it covers feature extraction and the scoring mechanism used by linear classifiers to predict classes.

Uploaded by

rimahmood2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views52 pages

W4 - Logistic Regression

The document outlines a lecture on Machine Learning focusing on logistic regression and its application in designing a spam filter. It discusses the problem setting, loss functions, and various algorithms such as Naïve Bayes and Nearest Neighbour. Additionally, it covers feature extraction and the scoring mechanism used by linear classifiers to predict classes.

Uploaded by

rimahmood2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Machine Learning (CS-245)

Spring – 2025

Dr. Mehwish Fatima


Assistant Professor,
AI & DS | SEECS | NUST, Islamabad
Overview of this week’s lecture

Logistic Regression

- Problem Setting: Designing a spam filter

- Loss functions in classification

- Naïve Bayes Algorithm

- Nearest Neighbour Algorithm

WM/04.02 S. 2

02/17
Let’s make an email spam filter

- Suppose you want to develop a ML-based model that detects whether a certain
email is important or spam.

WM/04.02 S. 3
https://www.youtube.com/watch?v=4o5hSxvN_-s

03/17
Let’s make an email spam filter

- Suppose you want to develop a ML-based model that detects whether a certain
email is important or spam.
From: [email protected]
Date: February 13, 2023
Subject: CS-370 Announcement

Dear students,
Please note that the first quiz on
CS-370 will be conducted on …

From: [email protected]
Date: February 13, 2023
Subject: URGENT!!!

Hello Dear,
I am a Nigerian prince, and I have
a business proposal for you ...

WM/04.02 S. 4
https://www.youtube.com/watch?v=4o5hSxvN_-s

03/17
Let’s make an email spam filter

- Suppose you want to develop a ML-based model that detects whether a certain
email is important or spam.
From: [email protected]
Date: February 13, 2023
- Input: 𝒙 = 𝑒𝑚𝑎𝑖𝑙 𝑚𝑒𝑠𝑠𝑎𝑔𝑒
Subject: CS-370 Announcement
- Output: 𝑦 ∈ {𝑠𝑝𝑎𝑚, 𝑛𝑜𝑡 − 𝑠𝑝𝑎𝑚} or 𝑦 ∈ {+1, −1}
Dear students,
- Objective: Learn a predictor 𝑓 such that, Please note that the first quiz on
CS-370 will be conducted on …

𝒙 Model 𝑦 ∈ {1, 0}
From: [email protected]
Date: February 13, 2023
Subject: URGENT!!!

Hello Dear,
I am a Nigerian prince, and I have
a business proposal for you ...

WM/04.02 S. 5
https://www.youtube.com/watch?v=4o5hSxvN_-s

03/17
Let’s make an email spam filter

- Suppose you want to develop a ML-based model that detects whether a certain
email is important or spam.
From: [email protected]
Date: February 13, 2023
- Input: 𝒙 = 𝑒𝑚𝑎𝑖𝑙 𝑚𝑒𝑠𝑠𝑎𝑔𝑒
Subject: CS-370 Announcement
- Output: 𝑦 ∈ {𝑠𝑝𝑎𝑚, 𝑛𝑜𝑡 − 𝑠𝑝𝑎𝑚} or 𝑦 ∈ {+1, −1}
Dear students,
- Objective: Learn a predictor 𝑓 such that, Please note that the first quiz on
CS-370 will be conducted on …

𝒙 Model 𝑦 ∈ {1, 0}
From: [email protected]
Date: February 13, 2023
- The training dataset is,
Subject: URGENT!!!
𝒟𝑡𝑟𝑎𝑖𝑛 = [("… CS−370 …", −1), Partial
Specifications of Hello Dear,
("… 10 million USD …", +1),
I am a Nigerian prince, and I have
("… PVC pipes at redued …", +1)] behaviour
a business proposal for you ...

WM/04.02 S. 6
https://www.youtube.com/watch?v=4o5hSxvN_-s

03/17
In machine learning, input features are hand-crafted

- Let the example task is to predict whether a string 𝑥 is a valid email address.

- What properties of 𝑥 might be relevant for predicting 𝑦?

WM/04.02 S. 7

04/17
In machine learning, input features are hand-crafted

- Let the example task is to predict whether a string 𝑥 is a valid email address.

- What properties of 𝑥 might be relevant for predicting 𝑦?

- A feature extractor receives an input 𝑥 and outputs a set of feature-name and


feature-value pairs.

Length>10 : True 1
fracOfAlphabets : 0.85 0.85
Feature
𝑎𝑏𝑐@𝑔𝑚𝑎𝑖𝑙. 𝑐𝑜𝑚 Contains_@ : True 1
Extractor
endsWith_.com : True 1
endsWith_.edu : False 0

WM/04.02 S. 8

04/17
In machine learning, input features are hand-crafted

- Let the example task is to predict whether a string 𝑥 is a valid email address.

- What properties of 𝑥 might be relevant for predicting 𝑦?

- A feature extractor receives an input 𝑥 and outputs a set of feature-name and


feature-value pairs.

Length>10 : True 1
fracOfAlphabets : 0.85 0.85
Feature
𝑎𝑏𝑐@𝑔𝑚𝑎𝑖𝑙. 𝑐𝑜𝑚 Contains_@ : True 1
Extractor
endsWith_.com : True 1
endsWith_.edu : False 0
- For an input 𝑥, its feature vector is,

𝝓 𝑥 = [𝜙1 𝑥 , 𝜙2 𝑥 , … , 𝜙𝑑 (𝑥)]

- Think of 𝝓 𝑥 ∈ ℝ𝑑 as a point in a high-dimensional space.

WM/04.02 S. 9

04/17
A linear classifier calculates scores to predict classes

- Score of a training example (𝑥, 𝑦) is weighted sum of features. It represents how


confident the model is about a prediction.
Feature vector 𝜙 𝑥 ∈ ℝ𝒅
𝑑

𝑠𝑐𝑜𝑟𝑒 = 𝒘. 𝜙 𝑥 = ෍ 𝑤𝑖 𝜙(𝑥)𝑖 Length>10 :1


𝑖=1
fracOfAlphabets : 0.85
Contains_@ :1
endsWith_.com :1
endsWith_.edu :0

Weight vector 𝒘 ∈ ℝ𝒅
Length>10 : -1.2
fracOfAlphabets : 0.6
Contains_@ :3
endsWith_.com : 2.2
endsWith_.edu : 2.8

WM/04.02 S. 10

05/17
A linear classifier calculates scores to predict classes

- Score of a training example (𝑥, 𝑦) is weighted sum of features. It represents how


confident the model is about a prediction.
Feature vector 𝜙 𝑥 ∈ ℝ𝒅
𝑑

𝑠𝑐𝑜𝑟𝑒 = 𝒘. 𝜙 𝑥 = ෍ 𝑤𝑖 𝜙(𝑥)𝑖 Length>10 :1


𝑖=1
fracOfAlphabets : 0.85
Contains_@ :1
- Margin of an example (𝑥, 𝑦) is score multiplied by the true label. It represents how endsWith_.com :1
correct a model is about a prediction. endsWith_.edu :0

𝑚𝑎𝑟𝑔𝑖𝑛 = 𝒘. 𝜙 𝑥 . 𝑦
Weight vector 𝒘 ∈ ℝ𝒅
Length>10 : -1.2
fracOfAlphabets : 0.6
Contains_@ :3
endsWith_.com : 2.2
endsWith_.edu : 2.8

WM/04.02 S. 11

05/17
A linear classifier calculates scores to predict classes

- Score of a training example (𝑥, 𝑦) is weighted sum of features. It represents how


confident the model is about a prediction.
Feature vector 𝜙 𝑥 ∈ ℝ𝒅
𝑑

𝑠𝑐𝑜𝑟𝑒 = 𝒘. 𝜙 𝑥 = ෍ 𝑤𝑖 𝜙(𝑥)𝑖 Length>10 :1


𝑖=1
fracOfAlphabets : 0.85
Contains_@ :1
- Margin of an example (𝑥, 𝑦) is score multiplied by the true label. It represents how endsWith_.com :1
correct a model is about a prediction. endsWith_.edu :0

𝑚𝑎𝑟𝑔𝑖𝑛 = 𝒘. 𝜙 𝑥 . 𝑦
Weight vector 𝒘 ∈ ℝ𝒅
- A linear classifier maps the scores to the given classes using appropriate function. Length>10 : -1.2
fracOfAlphabets : 0.6
+1, 𝑖𝑓 𝒘. 𝜙 𝑥 > 0 Contains_@ :3
𝑓𝑤 𝑥 = 𝑠𝑖𝑔𝑛 𝒘. 𝜙 𝑥 = ൞−1, 𝑖𝑓 𝒘. 𝜙 𝑥 < 0 endsWith_.com : 2.2
?, 𝑖𝑓 𝒘. 𝜙 𝑥 = 0 endsWith_.edu : 2.8

WM/04.02 S. 12

05/17
Relationship between data and weights can be visualised on 2D plan

- Let we have 𝑦
𝑓𝑤 𝑥 = 𝑠𝑖𝑔𝑛 𝒘. 𝜙 𝑥
𝑤 = 2, −1 5

4.5
𝝓 𝑥 ∈ 2,0 , 0,2 , [2,4] 4

3.5

2.5

1.5

0.5

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 𝑥

WM/04.02 S. 13

06/17
Relationship between data and weights can be visualised on 2D plan

- Let we have 𝑦
𝑓𝑤 𝑥 = 𝑠𝑖𝑔𝑛 𝒘. 𝜙 𝑥
𝑤 = 2, −1 5

4.5
𝝓 𝑥 ∈ 2,0 , 0,2 , [2,4] 4

3.5

2.5

1.5

0.5

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 𝑥

WM/04.02 S. 14

06/17
Relationship between data and weights can be visualised on 2D plan

- Let we have 𝑦
𝑓𝑤 𝑥 = 𝑠𝑖𝑔𝑛 𝒘. 𝜙 𝑥
𝑤 = 2, −1 5

4.5
𝝓 𝑥 ∈ 2,0 , 0,2 , [2,4] 4

3.5

2.5

2

1.5

0.5
+
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 𝑥

WM/04.02 S. 15

06/17
Relationship between data and weights can be visualised on 2D plan

- Let we have 𝑦
𝑓𝑤 𝑥 = 𝑠𝑖𝑔𝑛 𝒘. 𝜙 𝑥
𝑤 = 2, −1 5

4.5
𝝓 𝑥 ∈ 2,0 , 0,2 , [2,4] 4

3.5

2.5

2

1.5

0.5
+
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 𝑥

WM/04.02 S. 16

06/17
Relationship between data and weights can be visualised on 2D plan

- Let we have 𝑦
𝑓𝑤 𝑥 = 𝑠𝑖𝑔𝑛 𝒘. 𝜙 𝑥
𝑤 = 2, −1 5

4.5
𝝓 𝑥 ∈ 2,0 , 0,2 , [2,4] 4

3.5
- In general, a binary classifier 𝑓𝑤 defines a hyperplane decision boundary with
normal vector 𝑤. 3

2.5

- If 𝒙 ∈ ℝ2 : The hyperplane is a line 2



1.5
3
- If 𝒙 ∈ ℝ : The hyperplane is a plane
1

0.5
+
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 𝑥

WM/04.02 S. 17

06/17
Which loss function is suitable for a binary classifier?

- The simplest loss function for binary classification is Zero-One Loss.


𝑦
5
𝐿𝑜𝑠𝑠0−1 𝒙, 𝑦, 𝒘 = 𝟙 𝑓𝑤 𝑥 ≠ 𝑦 4.5
= 𝟙 𝒘. 𝜙 𝑥 . 𝑦 ≤ 0

𝐿𝑜𝑠𝑠 𝒙, 𝑦, 𝒘
4
3.5
3
2.5
2
1.5
1
0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 𝑥


(𝒘.𝜙(𝑥))𝑦

WM/04.02 S. 18

07/17
Which loss function is suitable for a binary classifier?

- The simplest loss function for binary classification is Zero-One Loss.


𝑦
5
𝐿𝑜𝑠𝑠0−1 𝒙, 𝑦, 𝒘 = 𝟙 𝑓𝑤 𝑥 ≠ 𝑦 4.5
= 𝟙 𝒘. 𝜙 𝑥 . 𝑦 ≤ 0

𝐿𝑜𝑠𝑠 𝒙, 𝑦, 𝒘
4
3.5
3
- What are the pros and cons of zero-one loss? 2.5
2
∞, 𝑖𝑓 𝑚𝑎𝑟𝑔𝑖𝑛 = 0 1.5
𝛁𝐿𝑜𝑠𝑠0−1 𝒙, 𝑦, 𝒘 = ቊ 1
0, 𝑖𝑓 𝑚𝑎𝑟𝑔𝑖𝑛 ≠ 0 0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 𝑥


(𝒘.𝜙(𝑥))𝑦

WM/04.02 S. 19

07/17
Which loss function is suitable for a binary classifier?

- The simplest loss function for binary classification is Zero-One Loss.


𝑦
5
𝐿𝑜𝑠𝑠0−1 𝒙, 𝑦, 𝒘 = 𝟙 𝑓𝑤 𝑥 ≠ 𝑦 4.5
= 𝟙 𝒘. 𝜙 𝑥 . 𝑦 ≤ 0

𝐿𝑜𝑠𝑠 𝒙, 𝑦, 𝒘
4
3.5
3
- What are the pros and cons of zero-one loss? 2.5
2
∞, 𝑖𝑓 𝑚𝑎𝑟𝑔𝑖𝑛 = 0 1.5
𝛁𝐿𝑜𝑠𝑠0−1 𝒙, 𝑦, 𝒘 = ቊ 1
0, 𝑖𝑓 𝑚𝑎𝑟𝑔𝑖𝑛 ≠ 0 0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 𝑥


- Hinge Loss is a better alternative to zero-one loss. (𝒘.𝜙(𝑥))𝑦
𝑦
𝐿𝑜𝑠𝑠ℎ𝑖𝑛𝑔𝑒 𝒙, 𝑦, 𝒘 = max 1 − 𝒘. 𝜙 𝑥 . 𝑦 , 0 5
0-1 Loss
4.5
Hinge Loss

𝐿𝑜𝑠𝑠 𝒙, 𝑦, 𝒘
4
3.5
3
2.5
2
1.5
1
0.5
WM/04.02 S. 20
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 𝑥
(𝒘.𝜙(𝑥))𝑦
07/17
Which loss function is suitable for a binary classifier?

- The simplest loss function for binary classification is Zero-One Loss.


𝑦
5
𝐿𝑜𝑠𝑠0−1 𝒙, 𝑦, 𝒘 = 𝟙 𝑓𝑤 𝑥 ≠ 𝑦 4.5
= 𝟙 𝒘. 𝜙 𝑥 . 𝑦 ≤ 0

𝐿𝑜𝑠𝑠 𝒙, 𝑦, 𝒘
4
3.5
3
- What are the pros and cons of zero-one loss? 2.5
2
∞, 𝑖𝑓 𝑚𝑎𝑟𝑔𝑖𝑛 = 0 1.5
𝛁𝐿𝑜𝑠𝑠0−1 𝒙, 𝑦, 𝒘 = ቊ 1
0, 𝑖𝑓 𝑚𝑎𝑟𝑔𝑖𝑛 ≠ 0 0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 𝑥


- Hinge Loss is a better alternative to zero-one loss. (𝒘.𝜙(𝑥))𝑦
𝑦
𝐿𝑜𝑠𝑠ℎ𝑖𝑛𝑔𝑒 𝒙, 𝑦, 𝒘 = max 1 − 𝒘. 𝜙 𝑥 . 𝑦 , 0 5
0-1 Loss
4.5
Hinge Loss

𝐿𝑜𝑠𝑠 𝒙, 𝑦, 𝒘
4
- It has non-trivial gradient. 3.5
3
2.5
0, 𝑖𝑓 𝑚𝑎𝑟𝑔𝑖𝑛 > 1 2
𝛁𝐿𝑜𝑠𝑠ℎ𝑖𝑛𝑔𝑒 𝒙, 𝑦, 𝒘 = ቊ 1.5
−𝜙 𝑥 . 𝑦, 𝑖𝑓 𝑚𝑎𝑟𝑔𝑖𝑛 ≤ 1
1
0.5
WM/04.02 S. 21
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 𝑥
(𝒘.𝜙(𝑥))𝑦
07/17
Which loss function is suitable for a binary classifier (continued)?

- Logistic loss or sigmoid loss is the most common loss function for binary
classification. 𝑦
5
0-1 Loss
4.5
𝐿𝑜𝑠𝑠𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝒙, 𝑦, 𝒘 = log 1 + 𝑒 −𝑚𝑎𝑟𝑔𝑖𝑛 Hinge Loss

𝐿𝑜𝑠𝑠 𝒙, 𝑦, 𝒘
4
3.5 Logistic Loss
3
2.5
- It tries to increase the margin even when it already exceeds 1. 2
1.5
1
0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 𝑥


(𝒘.𝜙(𝑥))𝑦

WM/04.02 S. 22

08/17
Which loss function is suitable for a binary classifier (continued)?

- Logistic loss or sigmoid loss is the most common loss function for binary
classification. 𝑦
5
0-1 Loss
4.5
𝐿𝑜𝑠𝑠𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝒙, 𝑦, 𝒘 = log 1 + 𝑒 −𝑚𝑎𝑟𝑔𝑖𝑛 Hinge Loss

𝐿𝑜𝑠𝑠 𝒙, 𝑦, 𝒘
4
3.5 Logistic Loss
3
2.5
- It tries to increase the margin even when it already exceeds 1. 2
1.5
1
- Why is it important to take the average loss to update weights? 0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 𝑥


2
Assume we have, 𝒙 𝑦 𝐿𝑜𝑠𝑠(𝒙, 𝑦, 𝒘) = 𝒘. 𝜙 𝑥 − 𝑦 (𝒘.𝜙(𝑥))𝑦
[1, 0] 2 𝑤1 − 2 2
[1, 0] 4 𝑤1 − 4 2
[0, 1] −1 𝑤2 + 1 2

WM/04.02 S. 23

08/17
Which loss function is suitable for a binary classifier (continued)?

- Logistic loss or sigmoid loss is the most common loss function for binary
classification. 𝑦
5
0-1 Loss
4.5
𝐿𝑜𝑠𝑠𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝒙, 𝑦, 𝒘 = log 1 + 𝑒 −𝑚𝑎𝑟𝑔𝑖𝑛 Hinge Loss

𝐿𝑜𝑠𝑠 𝒙, 𝑦, 𝒘
4
3.5 Logistic Loss
3
2.5
- It tries to increase the margin even when it already exceeds 1. 2
1.5
1
- Why is it important to take the average loss to update weights? 0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 𝑥


2
Assume we have, 𝒙 𝑦 𝐿𝑜𝑠𝑠(𝒙, 𝑦, 𝒘) = 𝒘. 𝜙 𝑥 − 𝑦 (𝒘.𝜙(𝑥))𝑦
[1, 0] 2 𝑤1 − 2 2
[1, 0] 4 𝑤1 − 4 2
[0, 1] −1 𝑤2 + 1 2

- Taking average of the three examples helps update weights that satisfy most
training examples.

WM/04.02 S. 24

08/17
Classification task can have multiple variations

- Multiclass classification:

- 𝑦 is a category. Categorising data into one of 𝐶 classes.

WM/04.02 S. 25

09/17
Classification task can have multiple variations

- Multiclass classification:

- 𝑦 is a category. Categorising data into one of 𝐶 classes.

- Ranking:

- 𝑦 is a permutation. Ranking web pages in order of relevance.

WM/04.02 S. 26

09/17
Classification task can have multiple variations

- Multiclass classification:

- 𝑦 is a category. Categorising data into one of 𝐶 classes.

- Ranking:

- 𝑦 is a permutation. Ranking web pages in order of relevance.

- Structured Prediction:

- 𝑦 is an object built from parts. Language Translation

WM/04.02 S. 27

09/17
A linear classifier can only fit data that are linearly separable

- Linear classifiers have a limited capacity to learn complex patterns


in the data.

WM/04.02 S. 28
https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

10/17
A linear classifier can only fit data that are linearly separable

- Linear classifiers have a limited capacity to learn complex patterns


in the data.

- Can you do multiclass classification using linear classifiers?

WM/04.02 S. 29
https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

10/17
A linear classifier can only fit data that are linearly separable

- Linear classifiers have a limited capacity to learn complex patterns


in the data.

- Can you do multiclass classification using linear classifiers?

- Yes, you can

WM/04.02 S. 30
https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

10/17
A linear classifier can only fit data that are linearly separable

- Linear classifiers have a limited capacity to learn complex patterns


in the data.

- Can you do multiclass classification using linear classifiers?

- Yes, you can

- Naïve Bayes, SVM, Linear Regression are examples of linear


classifiers.

WM/04.02 S. 31
https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

10/17
A linear classifier can only fit data that are linearly separable

- Linear classifiers have a limited capacity to learn complex patterns


in the data.

- Can you do multiclass classification using linear classifiers?

- Yes, you can

- Naïve Bayes, SVM, Linear Regression are examples of linear


classifiers.

- What if the data are not linearly separable?

WM/04.02 S. 32
https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

10/17
A linear classifier can only fit data that are linearly separable

- Linear classifiers have a limited capacity to learn complex patterns


in the data.

- Can you do multiclass classification using linear classifiers?

- Yes, you can

- Naïve Bayes, SVM, Linear Regression are examples of linear


classifiers.

- What if the data are not linearly separable?

- Watch this video

https://www.youtube.com/watch?v=3liCbRZPrZA

WM/04.02 S. 33
https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

10/17
Logistic Regression is binary classification using logistic function

- Use a logistic function or sigmoid function for predicting class.

1
𝑓𝑤 𝑥 = 𝜎 𝒘. 𝜙 𝑥 =
1 + 𝑒 −𝒘.𝜙 𝑥

- If 𝑦 = 1, we want 𝑓𝑤 𝑥 ≈ 1, 𝒘. 𝜙 𝑥 ≫ 0.

𝜎 𝒘. 𝜙 𝑥
- If 𝑦 = 0, we want 𝑓𝑤 𝑥 ≈ 0, 𝒘. 𝜙 𝑥 ≪ 0.

𝒘. 𝜙 𝑥

WM/04.02 S. 34

11/17
Logistic Regression is binary classification using logistic function

- Use a logistic function or sigmoid function for predicting class.

1
𝑓𝑤 𝑥 = 𝜎 𝒘. 𝜙 𝑥 =
1 + 𝑒 −𝒘.𝜙 𝑥

- If 𝑦 = 1, we want 𝑓𝑤 𝑥 ≈ 1, 𝒘. 𝜙 𝑥 ≫ 0.

𝜎 𝒘. 𝜙 𝑥
- If 𝑦 = 0, we want 𝑓𝑤 𝑥 ≈ 0, 𝒘. 𝜙 𝑥 ≪ 0.

- Normalises inputs between (0,1) giving an illusion of probability


(psudoprobability).

𝒘. 𝜙 𝑥

WM/04.02 S. 35

11/17
How to calculate the cost of an example using Negative Log Likelihood?

- Let the cost function is,


𝐽 𝑤 = −(𝑦 log 𝑓𝑤 𝑥 + 1 − 𝑦 log(1 − 𝑓𝑤 𝑥 ))

WM/04.02 S. 36

12/17
How to calculate the cost of an example using Negative Log Likelihood?

- Let the cost function is,


𝐽 𝑤 = −(𝑦 log 𝑓𝑤 𝑥 + 1 − 𝑦 log(1 − 𝑓𝑤 𝑥 ))

𝐿 𝜃 = −(𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦))


If 𝑦 = 1, we want 𝑓𝑤 𝑥 ≈ 1, 𝒘. 𝜙 𝑥 ≫ 0

WM/04.02 S. 37

12/17
How to calculate the cost of an example using Negative Log Likelihood?

- Let the cost function is,


𝐽 𝑤 = −(𝑦 log 𝑓𝑤 𝑥 + 1 − 𝑦 log(1 − 𝑓𝑤 𝑥 ))

𝐿 𝜃 = −(𝑦 log 𝑦ො + 1 − 𝑦 log(1 − 𝑦))


If 𝑦 = 1, we want 𝑓𝑤 𝑥 ≈ 1, 𝒘. 𝜙 𝑥 ≫ 0 If 𝑦 = 0, we want 𝑓𝑤 𝑥 ≈ 0, 𝒘. 𝜙 𝑥 ≪ 0

WM/04.02 S. 38

12/17
Decision boundary may or may not be linear

- How to find the decision boundary for this dataset?

- Set 𝒘 = [1, 1] and 𝑏 = −3

WM/04.02 S. 39

13/17
Decision boundary may or may not be linear

- How to find the decision boundary for this dataset?

- Set 𝒘 = [1, 1] and 𝑏 = −3

- How to find the decision boundary for this dataset?

- Let,
𝑓𝑤 𝑥 = 𝒘. 𝜙 𝑥 = 𝑔 𝑤1 𝜙1 𝑥 + 𝑤2 𝜙2 𝑥 + 𝑤3 𝜙12 𝑥 + 𝑤4 𝜙22 𝑥

- Set,
𝒘 = [0, 0, 1, 1] and 𝑏 = 1.

WM/04.02 S. 40

13/17
Naïve Bayes Classifier is a simple probabilistic classifier

- Let’s classify documents into different topics.

Dear
8

Friend
𝑝 𝐷𝑒𝑎𝑟 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.47
17

Lunch
5
𝑝 𝐹𝑟𝑖𝑒𝑛𝑑 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.29

Money
17
3
𝑝 𝐿𝑢𝑛𝑐ℎ 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.18
17
1 Normal
𝑝 𝑀𝑜𝑛𝑒𝑦 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.06
17

Money
Lunch
Dear
Friend
Spam
WM/04.02 S. 41

14/17
Naïve Bayes Classifier is a simple probabilistic classifier

- Let’s classify documents into different topics.

Dear
8

Friend
𝑝 𝐷𝑒𝑎𝑟 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.47
17

Lunch
5
𝑝 𝐹𝑟𝑖𝑒𝑛𝑑 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.29

Money
17
3
𝑝 𝐿𝑢𝑛𝑐ℎ 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.18
17
1 Normal
𝑝 𝑀𝑜𝑛𝑒𝑦 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.06
17
2
𝑝 𝐷𝑒𝑎𝑟 𝑆𝑝𝑎𝑚 = = 0.29
7
1
𝑝 𝐹𝑟𝑖𝑒𝑛𝑑 𝑆𝑝𝑎𝑚 = = 0.14
7
0

Money
𝑝 𝐿𝑢𝑛𝑐ℎ 𝑆𝑝𝑎𝑚 = = 0.0

Lunch
Dear
7

Friend
4
𝑝 𝑀𝑜𝑛𝑒𝑦 𝑆𝑝𝑎𝑚 = = 0.57
7
Spam
WM/04.02 S. 42

14/17
Naïve Bayes Classifier is a simple probabilistic classifier

- Let’s classify documents into different topics.

Dear
8

Friend
𝑝 𝐷𝑒𝑎𝑟 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.47
17

Lunch
5
𝑝 𝐹𝑟𝑖𝑒𝑛𝑑 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.29

Money
17
3
𝑝 𝐿𝑢𝑛𝑐ℎ 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.18
17
1 Normal
𝑝 𝑀𝑜𝑛𝑒𝑦 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.06
17
2
𝑝 𝐷𝑒𝑎𝑟 𝑆𝑝𝑎𝑚 = = 0.29
7
1
𝑝 𝐹𝑟𝑖𝑒𝑛𝑑 𝑆𝑝𝑎𝑚 = = 0.14
7
0

Money
𝑝 𝐿𝑢𝑛𝑐ℎ 𝑆𝑝𝑎𝑚 = = 0.0

Lunch
Dear
7

Friend
4
𝑝 𝑀𝑜𝑛𝑒𝑦 𝑆𝑝𝑎𝑚 = = 0.57
7

- The probabilities calculated above are called likelihood. Spam


WM/04.02 S. 43

14/17
Naïve Bayes does not pay attention to the sequence of inputs

- Given the sentence ‘Dear Friend’, does it belong to politics or sports?

Dear
Friend
Lunch
Money
Normal

Money
Lunch
Dear
Friend
Spam
WM/04.02 S. 44

15/17
Naïve Bayes does not pay attention to the sequence of inputs

- Given the sentence ‘Dear Friend’, does it belong to politics or sports?

Dear
- Start with a Prior probability.

Friend
8
𝑃 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.67

Lunch
12

Money
4
𝑃 𝑆𝑝𝑎𝑚 = = 0.33
12
Normal

Money
Lunch
Dear
Friend
Spam
WM/04.02 S. 45

15/17
Naïve Bayes does not pay attention to the sequence of inputs

- Given the sentence ‘Dear Friend’, does it belong to politics or sports?

Dear
- Start with a Prior probability.

Friend
8
𝑃 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.67

Lunch
12

Money
4
𝑃 𝑆𝑝𝑎𝑚 = = 0.33
12
Normal
- Multiply prior with the likelihood of the message.

𝑝 𝑁𝑜𝑟𝑚𝑎𝑙 𝑑𝑒𝑎𝑟 𝑓𝑟𝑖𝑒𝑛𝑑 = 𝑝 𝑁𝑜𝑟𝑚𝑎𝑙 × 𝑝 𝑑𝑒𝑎𝑟 𝑁𝑜𝑟𝑚𝑎𝑙 × 𝑝 𝑓𝑟𝑖𝑒𝑛𝑑 𝑁𝑜𝑟𝑚𝑎𝑙

𝑝 𝑆𝑝𝑎𝑚 𝑑𝑒𝑎𝑟 𝑓𝑟𝑖𝑒𝑛𝑑 = 𝑝(𝑆𝑝𝑎𝑚) × 𝑝(𝑑𝑒𝑎𝑟|𝑆𝑝𝑎𝑚) × 𝑝(𝑓𝑟𝑖𝑒𝑛𝑑|𝑆𝑝𝑎𝑚)

Money
Lunch
Dear
Friend
Spam
WM/04.02 S. 46

15/17
Naïve Bayes does not pay attention to the sequence of inputs

- Given the sentence ‘Dear Friend’, does it belong to politics or sports?

Dear
- Start with a Prior probability.

Friend
8
𝑃 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.67

Lunch
12

Money
4
𝑃 𝑆𝑝𝑎𝑚 = = 0.33
12
Normal
- Multiply prior with the likelihood of the message.

𝑝 𝑁𝑜𝑟𝑚𝑎𝑙 𝑑𝑒𝑎𝑟 𝑓𝑟𝑖𝑒𝑛𝑑 = 𝑝 𝑁𝑜𝑟𝑚𝑎𝑙 × 𝑝 𝑑𝑒𝑎𝑟 𝑁𝑜𝑟𝑚𝑎𝑙 × 𝑝 𝑓𝑟𝑖𝑒𝑛𝑑 𝑁𝑜𝑟𝑚𝑎𝑙

𝑝 𝑆𝑝𝑎𝑚 𝑑𝑒𝑎𝑟 𝑓𝑟𝑖𝑒𝑛𝑑 = 𝑝(𝑆𝑝𝑎𝑚) × 𝑝(𝑑𝑒𝑎𝑟|𝑆𝑝𝑎𝑚) × 𝑝(𝑓𝑟𝑖𝑒𝑛𝑑|𝑆𝑝𝑎𝑚)

Money
Lunch
Dear
Friend
- The equality above is a proportionality.

Spam
WM/04.02 S. 47

15/17
Naïve Bayes does not pay attention to the sequence of inputs

- Given the sentence ‘Dear Friend’, does it belong to politics or sports?

Dear
- Start with a Prior probability.

Friend
8
𝑃 𝑁𝑜𝑟𝑚𝑎𝑙 = = 0.67

Lunch
12

Money
4
𝑃 𝑆𝑝𝑎𝑚 = = 0.33
12
Normal
- Multiply prior with the likelihood of the message.

𝑝 𝑁𝑜𝑟𝑚𝑎𝑙 𝑑𝑒𝑎𝑟 𝑓𝑟𝑖𝑒𝑛𝑑 = 𝑝 𝑁𝑜𝑟𝑚𝑎𝑙 × 𝑝 𝑑𝑒𝑎𝑟 𝑁𝑜𝑟𝑚𝑎𝑙 × 𝑝 𝑓𝑟𝑖𝑒𝑛𝑑 𝑁𝑜𝑟𝑚𝑎𝑙

𝑝 𝑆𝑝𝑎𝑚 𝑑𝑒𝑎𝑟 𝑓𝑟𝑖𝑒𝑛𝑑 = 𝑝(𝑆𝑝𝑎𝑚) × 𝑝(𝑑𝑒𝑎𝑟|𝑆𝑝𝑎𝑚) × 𝑝(𝑓𝑟𝑖𝑒𝑛𝑑|𝑆𝑝𝑎𝑚)

Money
Lunch
Dear
Friend
- The equality above is a proportionality.

- Redo the problem with the sentence ‘lunch money money money’. Spam
WM/04.02 S. 48
- Add a constant 𝛼 to each histogram.
15/17
What did we cover so far?

Classification Regression
Predictor Function 𝑠𝑖𝑔𝑛(𝑠𝑐𝑜𝑟𝑒) 𝑠𝑐𝑜𝑟𝑒
Relates to correct 𝑦 Margin (𝑠𝑐𝑜𝑟𝑒 × 𝑦) Residual (𝑠𝑐𝑜𝑟𝑒 − 𝑦)

0–1
Mean Absolute Error
Loss Functions Hinge
Mean Squared Error
Logistic

Optimiser SGD SGD

WM/04.02 S. 49

16/17
Summary of today’s lecture

- We discussed linear classification starting from binary classification to multi-class


classification.

- Learnt various loss functions. 𝒙 Model 𝑦 ∈ {1, 0}

- How to fit non-linear data with linear functions.

- Established the similarities and differences with linear regression.

- Discussed Naïve Bayes and Nearest Neighbour Algorithms.

WM/04.02 S. 50

17/17
Summary of today’s lecture

- We discussed linear classification starting from binary classification to multi-class


classification.

- Learnt various loss functions. 𝒙 Model 𝑦 ∈ {1, 0}

- How to fit non-linear data with linear functions.

- Established the similarities and differences with linear regression.

- Discussed Naïve Bayes and Nearest Neighbour Algorithms.

- Next Lecture:

- Decision Trees and Random Forest

WM/04.02 S. 51

17/17
Do you have any problem?

Some material (images, tables, text etc.) in this


presentation has been borrowed from different
books, lecture notes, and the web. The original
contents solely belong to their owners and are
used in this presentation only for clarifying
various educational concepts. Any copyright
infringement is not at all intended.

WM/04.02 S. 52

EOP

You might also like