Lec10 Winter2024 Annotated
Lec10 Winter2024 Annotated
Training Set:
Training, 𝑁
𝑑
Validation Sets 𝑥𝑛 , 𝑡𝑛
Hyper-param Learning
Tuning Algorithm
Training Set:
Training, 𝑁
𝑑
Validation Sets 𝑥𝑛 , 𝑡𝑛
Hyper-param Learning
Tuning Algorithm
Learning Evaluation
Algorithm
Training
𝑁
1 𝑇 2
𝐶 𝝎 = 𝑥𝑛 𝜔 − 𝑡𝑛
2𝐍
𝑛=1
𝟏 𝟐
𝐶 𝝎 = ∥ 𝒀 − 𝑿𝝎 ∥𝟐
𝟐
𝟏 𝑻
𝐶 𝝎 = 𝒀 − 𝑿𝝎 (𝒀 − 𝑿𝝎)
𝟐
SOFE 4620U – Machine Learning and Data Mining
© by Dr. Mennatullah Siam
Winter 2024
Linear Regression
Training
𝑁
1 𝑇 2
𝐶 𝝎 = 𝑥𝑛 𝜔 − 𝑡𝑛
2𝐍
𝑛=1
Training
𝑁
1 𝑇 2
𝐶 𝝎 = 𝑥𝑛 𝜔 − 𝑡𝑛
2𝐍
𝑛=1
Training
𝑇 −1 𝑇
𝝎 = (𝑋 𝑋) (𝑋 𝑌)
Predict
𝑌 = 𝑋𝝎
𝑵×𝟏 𝑵×𝒅 𝒅×𝟏
Evaluation
𝑁
1 𝑇 2
𝒓𝒎𝒔𝒆 = 𝑥𝑛 𝜔 − 𝑡𝑛
𝐍
𝑛=1
𝑁
1 𝑇
𝒎𝒂𝒆 = |𝑥𝑛 𝜔 − 𝑡𝑛 |
𝐍
𝑛=1
No regularization will result in ….? While too much regularization will result in …?
Training
𝑁 𝑫
1 𝑇 2 2
𝐶 𝝎 = 𝑥𝑛 𝜔 − 𝑡𝑛 + 𝜔𝑑
2𝐍
𝑛=1 𝐝=1
𝟐 𝟐
𝐶 𝝎 = ∥ 𝒀 − 𝑿𝝎 ∥𝟐 +𝝀 ∥ 𝝎 ∥𝟐
𝑻 𝑻
𝐶 𝝎 = 𝒀 − 𝑿𝝎 (𝒀 − 𝑿𝝎) + 𝝀𝝎 𝝎
Training
𝑻 −𝟏 𝑻
𝝎 = 𝑿 𝑿 + 𝝀𝑰 𝑿 𝒀
𝟐
𝐶 𝝎 = Original Cost Fn. + 𝝀 ∥ 𝝎 ∥𝟐
L1 Regularization
𝟐
𝐶 𝝎 = Original Cost Fn. + 𝝀 ∥ 𝝎 ∥𝟐
L1 Regularization
L1 Regularization
Push the weights of
𝐶 𝝎 = Original Cost Fn. +𝝀 ∥ 𝝎 ∥𝟏 unimportant features to be
zero, favors sparse solutions
𝜽 Estimated
𝜽 =𝑬 𝜽
𝑩𝒊𝒂𝒔 𝜽, −𝜽
𝟐
=𝑬 𝜽
𝑽𝒂𝒓 𝜽 −𝑬 𝜽
𝜽 Estimated
Higher sensitivity to
changes in the data,
a.k.a high variance
𝜽 =𝑬 𝜽
𝑩𝒊𝒂𝒔 𝜽, −𝜽
𝟐
=𝑬 𝜽
𝑽𝒂𝒓 𝜽 −𝑬 𝜽
Training
𝑁
1 𝑇 𝑇
𝐶 𝝎 = − t n log 𝜎 𝑥𝑛 𝜔 − (1 − t n )(log 1 − 𝜎 𝑥𝑛 𝜔
𝐍
𝑛=1
𝑇
1
𝜎 𝑥𝑛 𝜔 = 𝑇
1+ 𝑒 −(𝑥𝑛 𝜔)
Why use sigmoid function in logistic regression?
Why not use it in linear regression?
Training
𝑁
1 𝑇 𝑇
𝐶 𝝎 = − t n log 𝜎 𝑥𝑛 𝜔 − (1 − t n )(log 1 − 𝜎 𝑥𝑛 𝜔
𝐍
𝑛=1
𝑇
1
𝜎 𝑥𝑛 𝜔 = 𝑇
1+ 𝑒 −(𝑥𝑛 𝜔)
What does the output from sigma
represent?
Predict
𝑇
1
𝜎 𝑥𝑛 𝜔 = 𝑇
1+ 𝑒 −(𝑥𝑛 𝜔)
𝑇
𝜎 𝑥𝑛 𝜔 > 0.5: 𝑡Ƹ𝑛 = 1
𝑇
𝜎 𝑥𝑛 𝜔 < 0.5: 𝑡Ƹ𝑛 = 0
Predict
𝑇 𝑇
𝑥𝑛 𝜔 𝑥𝑛 𝜔 +𝑏
How can these be the same
when predicting? What to
modify in x?
Evaluation
𝒕𝒏
Positive Negative
𝒕ො𝒏
Positive TP FP
Negative FN TN
TP + TN
𝐚𝐜𝐜 =
TP + FP + FN + TN
𝑻
𝒙𝒏 𝝎 <𝟎
𝑻
𝝈(𝒙𝒏 𝝎) < 𝟎. 𝟓
What is the main difference between SGD and Batch Gradient Descent?
𝑁
1 𝑇 𝑇
𝐶 𝝎 = − t n log 𝜎 𝑥𝑛 𝜔 − (1 − t n )(log 1 − 𝜎 𝑥𝑛 𝜔
𝐍
𝑛=1
𝑁
𝑑 1 𝑇
𝐶 𝜔 = 𝑥𝑛 (𝜎 𝑥𝑛 𝜔 − 𝑡𝑛 )
𝑑𝜔 𝑁
𝑛=0
𝑝 𝑥 𝑦 = ෑ 𝑝 𝑥𝑖 𝑦)
𝑖=1
𝑝 𝑥 𝑆 𝑝(𝑆)
𝑝 𝑆|𝑥 =
𝑝 𝑥 𝑆 𝑝 𝑆 + 𝑝 𝑥 𝐸 𝑝(𝐸)
𝑝 𝑥|𝑆 = 𝑝 𝑥1 𝑆 𝑝 𝑥2 𝑆 𝑝 𝑥3 𝑆 𝑝 𝑥4 𝑆 𝑝(𝑥5 |𝑆)
What is computed during training? Remember the Scottish vs. English classifier?
𝑇
𝝎 𝑥 + 𝑏 = −1 𝟐
𝒙
𝑇
What are the support vectors? 𝝎 𝑥+𝑏 =0
𝟏
𝒙
SOFE 4620U – Machine Learning and Data Mining
© by Dr. Mennatullah Siam
Winter 2024
Support Vector Machines
Training
1 2 𝑇 (𝒊) (𝒊)
𝐽 = | 𝝎 | − α𝑖 ( 𝝎 𝒙 + 𝑏 𝒚 − 1 )
2
𝑖
I will never ask you to derive any of SVM equations so don worry about that part. But you
need to understand whats alpha or kernel and so on.
SOFE 4620U – Machine Learning and Data Mining
© by Dr. Mennatullah Siam
Winter 2024
Support Vector Machines
Predict
(𝒊) (𝒊)
𝝎 = 𝛼𝑖 𝒚 𝒙
𝑇
𝝎 𝒙+𝑏 ≥0 for positive instances.
𝑇
𝝎 𝒙+𝑏 <0 for negative instances
1 (𝒊) (𝒋) 𝒊 𝑻 𝒋
𝐽 = 𝛼𝑖 − 𝛼𝑖 𝛼𝒋 𝒚 𝒚 (𝝓(𝒙 ) 𝝓(𝒙 ))
2
𝑖 𝑖 𝑗
𝑻 𝒅
𝑲(𝒙𝒊 , 𝒙𝒋 ) → (𝒙𝒊 • 𝒙𝒋 + 1)
𝒊 𝒋 𝒊 𝑻 𝒋 𝟐
𝑲 𝒙 ,𝒙 = 𝝓(𝒙 ) 𝝓(𝒙 ) | 𝒙𝒊 − 𝒙𝒋 | 𝟐
𝑲(𝒙𝒊 , 𝒙𝒋 ) → 𝒆𝒙𝒑 −
𝟐𝝈𝟐
SOFE 4620U – Machine Learning and Data Mining
© by Dr. Mennatullah Siam
Winter 2024
Additional Questions
• A project team reports a low training error and claims their
method is good
• A project team split their data into training, validation and test.
Using their training data and validation, they chose the best
parameter setting. They built a model using these parameters
and their training data, and then report their error on test data
SOFE 4620U – Machine Learning and Data Mining
© by Dr. Mennatullah Siam
Winter 2024
Additional Questions
• A project team reports a low training error and claims their
method is good problematic
• A project team split their data into training, validation and test.
Using their training data and validation, they chose the best
parameter setting. They built a model using these parameters
and their training data, and then report their error on test data Ok