0% found this document useful (0 votes)
2 views48 pages

Lec10 Winter2024 Annotated

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views48 pages

Lec10 Winter2024 Annotated

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Revision

Dr. Mennatullah Siam

SOFE 4620 Machine Learning and Data Mining.


Winter 2024
Instructor: Dr. Mennatullah Siam
© by Dr. Mennatullah Siam 1 Software Engineering Department
Electrical, Computer and
08/01/2024
Topics we Covered so far
• Linear Regression [ Closed Form]
• Regularization - Ridge Regression [Closed Form]
• Logistic Regression
• Gradient Descent [Linear Regression – Logistic Regression]
• Parameter Estimation [MAP vs. MLE, above within MLE framework]
• Naïve Bayes
• SVM

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Supervised vs. Unsupervised Learning

• What is the main difference between supervised and


unsupervised learning methods?

• Can you give examples on each?

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Classification vs. Regression

• What is the main difference between classification and


regression?

• Can you give examples on approaches that we can use


in these two settings?

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Phases of building ML Model

Training Set:
Training, 𝑁
𝑑
Validation Sets 𝑥𝑛 , 𝑡𝑛

Hyper-param Learning
Tuning Algorithm

Best Optimal Hypothesis


Hyper − params Parameters ℎ(𝑥)

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Phases of building ML Model
What’s the difference between hyper-
parameters and parameters of your model?

Training Set:
Training, 𝑁
𝑑
Validation Sets 𝑥𝑛 , 𝑡𝑛

Hyper-param Learning
Tuning Algorithm

Best Optimal Hypothesis


Hyper − params Parameters ℎ(𝑥)

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Phases of building ML Model

Training Set: Testing Set:


𝑁 𝑑 𝑁
𝑑 𝑥𝑛 , 𝑡𝑛
𝑥𝑛 , 𝑡𝑛

Learning Evaluation
Algorithm

Optimal Hypothesis Evaluation


Parameters
ℎ(𝑥) Metrics

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Linear Regression
Similar cost functions just
different ways of writing it.

Training
𝑁
1 𝑇 2
𝐶 𝝎 = ෍ 𝑥𝑛 𝜔 − 𝑡𝑛
2𝐍
𝑛=1

𝟏 𝟐
𝐶 𝝎 = ∥ 𝒀 − 𝑿𝝎 ∥𝟐
𝟐
𝟏 𝑻
𝐶 𝝎 = 𝒀 − 𝑿𝝎 (𝒀 − 𝑿𝝎)
𝟐
SOFE 4620U – Machine Learning and Data Mining
© by Dr. Mennatullah Siam
Winter 2024
Linear Regression

Training
𝑁
1 𝑇 2
𝐶 𝝎 = ෍ 𝑥𝑛 𝜔 − 𝑡𝑛
2𝐍
𝑛=1

What will change if I want to add bias to the


regression?

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Linear Regression

Training
𝑁
1 𝑇 2
𝐶 𝝎 = ෍ 𝑥𝑛 𝜔 − 𝑡𝑛
2𝐍
𝑛=1

What will change in polynomial regression?

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Linear Regression
Closed Form.

Training

𝑇 −1 𝑇
𝝎 = (𝑋 𝑋) (𝑋 𝑌)

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Linear Regression

Predict

𝑌෠ = 𝑋𝝎
𝑵×𝟏 𝑵×𝒅 𝒅×𝟏

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Linear Regression

Evaluation

𝑁
1 𝑇 2
𝒓𝒎𝒔𝒆 = ෍ 𝑥𝑛 𝜔 − 𝑡𝑛
𝐍
𝑛=1

𝑁
1 𝑇
𝒎𝒂𝒆 = ෍ |𝑥𝑛 𝜔 − 𝑡𝑛 |
𝐍
𝑛=1

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Regularization
M=9, no regularization M=9, a good amount of M=9, Too much regularization
regularization

No regularization will result in ….? While too much regularization will result in …?

How to determine best regularization factor?


SOFE 4620U – Machine Learning and Data Mining
© by Dr. Mennatullah Siam
Winter 2024
Ridge Regression

Training
𝑁 𝑫
1 𝑇 2 2
𝐶 𝝎 = ෍ 𝑥𝑛 𝜔 − 𝑡𝑛 + ෍ 𝜔𝑑
2𝐍
𝑛=1 𝐝=1
𝟐 𝟐
𝐶 𝝎 = ∥ 𝒀 − 𝑿𝝎 ∥𝟐 +𝝀 ∥ 𝝎 ∥𝟐

𝑻 𝑻
𝐶 𝝎 = 𝒀 − 𝑿𝝎 (𝒀 − 𝑿𝝎) + 𝝀𝝎 𝝎

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Ridge Regression
Closed Form.

Training

𝑻 −𝟏 𝑻
𝝎 = 𝑿 𝑿 + 𝝀𝑰 𝑿 𝒀

Does the prediction change for Ridge Regression


vs. Linear Regression?

How to evaluate Ridge Regression?

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
L2 vs. L1 Regularization
L2 Regularization

𝟐
𝐶 𝝎 = Original Cost Fn. + 𝝀 ∥ 𝝎 ∥𝟐

L1 Regularization

𝐶 𝝎 = Original Cost Fn. +𝝀 ∥ 𝝎 ∥𝟏

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
L2 vs. L1 Regularization
L2 Regularization

𝟐
𝐶 𝝎 = Original Cost Fn. + 𝝀 ∥ 𝝎 ∥𝟐

L1 Regularization

𝐶 𝝎 = Original Cost Fn. +𝝀 ∥ 𝝎 ∥𝟏

What is the main difference between L1 and L2


regularization?

Can we use L2/L1 regularization on other methods


in classification e.g., Neural Networks?
SOFE 4620U – Machine Learning and Data Mining
© by Dr. Mennatullah Siam
Winter 2024
L2 vs. L1 Regularization
L2 Regularization
Push the weights of all
𝟐 features to be small
𝐶 𝝎 = Original Cost Fn. + 𝝀 ∥ 𝝎 ∥𝟐

L1 Regularization
Push the weights of
𝐶 𝝎 = Original Cost Fn. +𝝀 ∥ 𝝎 ∥𝟏 unimportant features to be
zero, favors sparse solutions

What is the main difference between L1 and L2


regularization?

Can we use L2/L1 regularization on other methods


in classification e.g., Neural Networks?
SOFE 4620U – Machine Learning and Data Mining
© by Dr. Mennatullah Siam
Winter 2024
Bias – Variance Tradeoff
𝜽 True


𝜽 Estimated

෡ 𝜽 =𝑬 𝜽
𝑩𝒊𝒂𝒔 𝜽, ෡ −𝜽
𝟐
෡ =𝑬 𝜽
𝑽𝒂𝒓 𝜽 ෡−𝑬 𝜽

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Bias – Variance Tradeoff
𝜽 True


𝜽 Estimated

Higher sensitivity to
changes in the data,
a.k.a high variance
෡ 𝜽 =𝑬 𝜽
𝑩𝒊𝒂𝒔 𝜽, ෡ −𝜽
𝟐
෡ =𝑬 𝜽
𝑽𝒂𝒓 𝜽 ෡−𝑬 𝜽

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Topics we Covered so far
• Linear Regression [ Closed Form]
• Regularization - Ridge Regression [Closed Form]
• Logistic Regression
• Gradient Descent [Linear Regression – Logistic Regression]
• Parameter Estimation [MAP vs. MLE, above within MLE framework]
• Naïve Bayes
• SVM

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Linear Classification:
Logistic Regression

Training

𝑁
1 𝑇 𝑇
𝐶 𝝎 = ෍ − t n log 𝜎 𝑥𝑛 𝜔 − (1 − t n )(log 1 − 𝜎 𝑥𝑛 𝜔
𝐍
𝑛=1
𝑇
1
𝜎 𝑥𝑛 𝜔 = 𝑇
1+ 𝑒 −(𝑥𝑛 𝜔)
Why use sigmoid function in logistic regression?
Why not use it in linear regression?

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Linear Classification:
Logistic Regression

Training

𝑁
1 𝑇 𝑇
𝐶 𝝎 = ෍ − t n log 𝜎 𝑥𝑛 𝜔 − (1 − t n )(log 1 − 𝜎 𝑥𝑛 𝜔
𝐍
𝑛=1
𝑇
1
𝜎 𝑥𝑛 𝜔 = 𝑇
1+ 𝑒 −(𝑥𝑛 𝜔)
What does the output from sigma
represent?

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Logistic Regression

Predict

𝑇
1
𝜎 𝑥𝑛 𝜔 = 𝑇
1+ 𝑒 −(𝑥𝑛 𝜔)

𝑇
𝜎 𝑥𝑛 𝜔 > 0.5: 𝑡Ƹ𝑛 = 1
𝑇
𝜎 𝑥𝑛 𝜔 < 0.5: 𝑡Ƹ𝑛 = 0

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Logistic Regression

Predict

𝑇 𝑇
𝑥𝑛 𝜔 𝑥𝑛 𝜔 +𝑏
How can these be the same
when predicting? What to
modify in x?

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Logistic Regression

Evaluation
𝒕𝒏
Positive Negative
𝒕ො𝒏

Positive TP FP

Negative FN TN

TP + TN
𝐚𝐜𝐜 =
TP + FP + FN + TN

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Logistic Regression
Why region 1 is called positive half space and
region 2 is called negative half space? 1

What happens if we took the dot product between


features and weights for data points on the
decision boundary, what’s the output? 2

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Logistic Regression
𝑻
𝒙𝒏 𝝎 >𝟎 𝒙𝑻𝒏 𝝎 = 𝟎
Why region 1 is called positive half space and
𝑻 𝑻
region 2 is called negative half space? 𝝈(𝒙𝒏 𝝎) > 𝟎. 𝟓 𝝈(𝒙𝒏 𝝎) = 𝟎. 𝟓

𝑻
𝒙𝒏 𝝎 <𝟎
𝑻
𝝈(𝒙𝒏 𝝎) < 𝟎. 𝟓

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Gradient Descent

What is the main difference between SGD and Batch Gradient Descent?

Why use the gradient direction?

How to set learning rate?


SOFE 4620U – Machine Learning and Data Mining
© by Dr. Mennatullah Siam
Winter 2024
Gradient Descent – Linear Regression

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Gradient Descent – Logistic Regression

𝑁
1 𝑇 𝑇
𝐶 𝝎 = ෍ − t n log 𝜎 𝑥𝑛 𝜔 − (1 − t n )(log 1 − 𝜎 𝑥𝑛 𝜔
𝐍
𝑛=1

𝑁
𝑑 1 𝑇
𝐶 𝜔 = ෍ 𝑥𝑛 (𝜎 𝑥𝑛 𝜔 − 𝑡𝑛 )
𝑑𝜔 𝑁
𝑛=0

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Topics we Covered so far
• Linear Regression [ Closed Form]
• Regularization - Ridge Regression [Closed Form]
• Logistic Regression
• Gradient Descent [Linear Regression – Logistic Regression]
• Parameter Estimation [MAP vs. MLE, above within MLE framework]
• Naïve Bayes
• SVM

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Parameter Estimation
• What is the main different between MLE and MAP?
• What will happen if we have large scale data?
• If we put logistic and linear regression within MLE framework what is the
underlying distribution assumed of p(y|x)?

𝑓𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑓 𝜖 ℱ 𝑝(𝑓|𝐷)


𝑓𝑀𝐿𝐸 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑓 𝜖 ℱ 𝑝(𝐷|𝑓)
𝑝 𝑓 𝐷 ∝ 𝑝 𝐷 𝑓 𝑝(𝑓)

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Topics we Covered so far
• Linear Regression [ Closed Form]
• Regularization - Ridge Regression [Closed Form]
• Logistic Regression
• Gradient Descent [Linear Regression – Logistic Regression]
• Parameter Estimation [MAP vs. MLE, above within MLE framework]
• Naïve Bayes
• SVM

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Naïve Bayes
Training
𝑝 𝑥 𝑦 𝑝(𝑦)
𝑝 𝑦𝑥 =
𝑝(𝑥)

𝑝 𝑥 𝑦 = ෑ 𝑝 𝑥𝑖 𝑦)
𝑖=1

What is the major assumption in naïve bayes?

What is computed during training?


SOFE 4620U – Machine Learning and Data Mining
© by Dr. Mennatullah Siam
Winter 2024
Naïve Bayes
Training

𝑝 𝑥 𝑆 𝑝(𝑆)
𝑝 𝑆|𝑥 =
𝑝 𝑥 𝑆 𝑝 𝑆 + 𝑝 𝑥 𝐸 𝑝(𝐸)
𝑝 𝑥|𝑆 = 𝑝 𝑥1 𝑆 𝑝 𝑥2 𝑆 𝑝 𝑥3 𝑆 𝑝 𝑥4 𝑆 𝑝(𝑥5 |𝑆)

What is the major assumption in naïve bayes?

What is computed during training? Remember the Scottish vs. English classifier?

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Naïve Bayes
Predict

𝑝 𝑆|𝑥 ≥ 0.5 Scottish

𝑝 𝑆|𝑥 < 0.5 Other Class


English

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Naïve Bayes
Evaluation

How can I evaluate Naïve Bayes


Classifier?

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Support Vector Machines
𝑇
Define the Planes as… 𝝎 𝑥 + 𝑏 = +1

𝑇
𝝎 𝑥 + 𝑏 = −1 𝟐
𝒙

𝑇
What are the support vectors? 𝝎 𝑥+𝑏 =0
𝟏
𝒙
SOFE 4620U – Machine Learning and Data Mining
© by Dr. Mennatullah Siam
Winter 2024
Support Vector Machines
Training

1 2 𝑇 (𝒊) (𝒊)
𝐽 = | 𝝎 | − ෍ α𝑖 ( 𝝎 𝒙 + 𝑏 𝒚 − 1 )
2
𝑖

What do the alphas represent?

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Support Vector Machines
Training

1 (𝒊) (𝒋) 𝒊 𝑻 (𝒋)


𝐽 = ෍ 𝛼𝑖 − ෍ ෍ 𝛼𝑖 𝛼𝒋 𝒚 𝒚 (𝒙 𝒙 )
2
𝑖 𝑖 𝑗
Why is this called the dual form?

How to transform this, to work with non linearly separable data?

How to regularize SVM to avoid sensitivity to outliers?

I will never ask you to derive any of SVM equations so don worry about that part. But you
need to understand whats alpha or kernel and so on.
SOFE 4620U – Machine Learning and Data Mining
© by Dr. Mennatullah Siam
Winter 2024
Support Vector Machines
Predict

(𝒊) (𝒊)
𝝎 = ෍ 𝛼𝑖 𝒚 𝒙
𝑇
𝝎 𝒙+𝑏 ≥0 for positive instances.
𝑇
𝝎 𝒙+𝑏 <0 for negative instances

(𝒊) (𝒊) 𝑻 (𝒋)


So, ෍ 𝛼𝑖 𝒚 𝒙 𝒙 + 𝑏 ≥ 0 for positive instances

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Support Vector Machines
Evaluation

How can I evaluate SVM model?

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Kernal SVM

• Dual form of Kernel SVM Training

1 (𝒊) (𝒋) 𝒊 𝑻 𝒋
𝐽 = ෍ 𝛼𝑖 − ෍ ෍ 𝛼𝑖 𝛼𝒋 𝒚 𝒚 (𝝓(𝒙 ) 𝝓(𝒙 ))
2
𝑖 𝑖 𝑗

𝑻 𝒅
𝑲(𝒙𝒊 , 𝒙𝒋 ) → (𝒙𝒊 • 𝒙𝒋 + 1)
𝒊 𝒋 𝒊 𝑻 𝒋 𝟐
𝑲 𝒙 ,𝒙 = 𝝓(𝒙 ) 𝝓(𝒙 ) | 𝒙𝒊 − 𝒙𝒋 | 𝟐
𝑲(𝒙𝒊 , 𝒙𝒋 ) → 𝒆𝒙𝒑 −
𝟐𝝈𝟐
SOFE 4620U – Machine Learning and Data Mining
© by Dr. Mennatullah Siam
Winter 2024
Additional Questions
• A project team reports a low training error and claims their
method is good

• A project team claimed great success after achieving 98


percent classification accuracy on a binary classification task
State whether where one class is very rare (e.g., detecting fraud
this setup is
problematic or transactions). Their data consisted of 50 positive examples
not? and 5 000 negative examples

• A project team split their data into training, validation and test.
Using their training data and validation, they chose the best
parameter setting. They built a model using these parameters
and their training data, and then report their error on test data
SOFE 4620U – Machine Learning and Data Mining
© by Dr. Mennatullah Siam
Winter 2024
Additional Questions
• A project team reports a low training error and claims their
method is good problematic

• A project team claimed great success after achieving 98


percent classification accuracy on a binary classification task
where one class is very rare (e.g., detecting fraud
transactions). Their data consisted of 50 positive examples
and 5 000 negative examples problematic

• A project team split their data into training, validation and test.
Using their training data and validation, they chose the best
parameter setting. They built a model using these parameters
and their training data, and then report their error on test data Ok

SOFE 4620U – Machine Learning and Data Mining


© by Dr. Mennatullah Siam
Winter 2024
Wish you all the best in Midterm

Remember why you are doing


that course!!

Not for grades hopefully


You can make a career out of ML

So focus more on learning


SOFE 4620U – Machine Learning and Data Mining
© by Dr. Mennatullah Siam
Winter 2024

You might also like