0% found this document useful (0 votes)
30 views60 pages

Logistic Regression - Update - 2

This document provides an overview of logistic regression using vectorization. It shows how to represent logistic regression parameters and computations as vectors and matrices rather than individual variables. This allows processing multiple samples simultaneously through matrix operations, improving efficiency. Key steps include representing the parameters as a weight vector and bias, computing the output and loss over a batch of samples, and taking the gradient to update the parameters through backpropagation. Vectorization avoids explicit loops and enables parallelization.

Uploaded by

Duy Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views60 pages

Logistic Regression - Update - 2

This document provides an overview of logistic regression using vectorization. It shows how to represent logistic regression parameters and computations as vectors and matrices rather than individual variables. This allows processing multiple samples simultaneously through matrix operations, improving efficiency. Key steps include representing the parameters as a weight vector and bias, computing the output and loss over a batch of samples, and taking the gradient to update the parameters through backpropagation. Vectorization avoids explicit loops and enables parallelization.

Uploaded by

Duy Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

AI VIETNAM

All-in-One Course

Insight into
Logistic Regression

Quang-Vinh Dinh
Ph.D. in Computer Science
Outline
➢ Vectorization
➢ Optimiztion for 1+ samples
➢ Logistic Regression – Mini-batch
➢ Logistic Regression – Batch
➢ BCE and MSE Loss Functions
➢ Sigmoid and Tanh Function (Optional)
demo
Implementation - One Sample Feature Label

1) Pick a sample (𝑥, 𝑦) from training data

2) Compute the output 𝑦ො


𝑧 = 𝑤𝑥 + 𝑏
1
𝑦ො = 𝜎(𝑧) =
1 + 𝑒 −𝑧
3) Compute loss
𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy )
4) Compute derivative

𝜕𝐿 𝜕𝐿
= 𝑥(𝑦ො − 𝑦) = (𝑦ො − 𝑦)
𝜕𝑤 𝜕𝑏
5) Update parameters
𝜕𝐿 𝜕𝐿
𝑤 =𝑤−𝜂 𝑏 =𝑏−𝜂
𝜕𝑤 𝜕𝑏
if #features changes, which functions are affected? 1
demo

1) Pick a sample (𝑥, 𝑦) from training data

2) Compute the output 𝑦ො


𝑧 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏
1
𝑦ො = 𝜎(𝑧) =
1 + 𝑒 −𝑧
3) Compute loss
𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy )
4) Compute derivative How to solve the problem?
𝜕𝐿 𝜕𝐿
= 𝑥𝑖 (𝑦ො − 𝑦) = (𝑦ො − 𝑦)
𝜕𝑤𝑖 𝜕𝑏
5) Update parameters
𝜕𝐿 𝜕𝐿
𝑤𝑖 = 𝑤𝑖 − 𝜂 𝑏 =𝑏−𝜂
𝜕𝑤𝑖 𝜕𝑏
2
Vector/Matrix Operations
Transpose
𝑣1
𝑣Ԧ = … 𝑣Ԧ 𝑇 = 𝑣1 … 𝑣𝑛
𝑣𝑛

T Multiply with a number


1
1 2 𝑢1 𝛼𝑢1
2
𝛼𝑢 = 𝛼 … = …
𝑢𝑛 𝛼𝑢𝑛

𝑎11 … 𝑎1𝑛 𝑎11 … 𝑎𝑚1


A = … … ... A𝑇 = … … …
𝑎𝑚1 … 𝑎𝑚𝑛 𝑎1𝑛 … 𝑎𝑚𝑛 data result

1 2
T 2 = 4
1 2 1 3
2 *
3 6
3 4 2 4
3
Vector/Matrix Operations
Dot product

𝑣1 𝑢1
𝑣Ԧ = … 𝑢= …
𝑣𝑛 𝑢𝑛

𝑣Ԧ ∙ 𝑢 = 𝑣1 × 𝑢1 + ⋯ + 𝑣𝑛 × 𝑢𝑛

v w result

1 2 2 = 8

4
AI VIETNAM Feature Label
All-in-One Course
Vectorization
1) Pick a sample (𝑥, 𝑦) from training data
𝑥 𝑦
2) Compute the output 𝑦ො
𝑧 = 𝑤𝑥 + 𝑏 Traditional
1 1 𝑏
𝑧 = 𝑤𝑥 + 𝑏 𝒙= 𝜽=
𝑦ො = 𝜎(𝑧) = 𝑥 𝑤
1 + 𝑒 −𝑧
3) Compute loss
𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy )
4) Compute derivative 𝑏 → 𝜽𝑇 = 𝑏 𝑤
𝜽=
𝑤
𝜕𝐿 𝜕𝐿
= 𝑥(𝑦ො − 𝑦) = (𝑦ො − 𝑦)
𝜕𝑤 𝜕𝑏
1
5) Update parameters 𝑧 = 𝑤𝑥 + 𝑏1 = 𝑏 𝑤 = 𝜽𝑇 𝒙
𝑥
𝜕𝐿 𝜕𝐿
𝑤 =𝑤−𝜂 𝑏 =𝑏−𝜂
𝜕𝑤 𝜕𝑏
𝜂 is learning rate dot product 5
AI VIETNAM
All-in-One Course
Vectorization
1) Pick a sample (𝑥, 𝑦) from training data
1 𝑏
2) Compute the output 𝑦ො 𝑧 = 𝑤𝑥 + 𝑏 𝒙= 𝜽=
𝑥 𝑤
𝑧 = 𝑤𝑥 + 𝑏 Traditional
1 1
𝑦ො = 𝜎(𝑧) = 𝑧= 𝜽𝑇 𝒙 𝑦ො = 𝜎(𝑧) =
1 + 𝑒 −𝑧
1 + 𝑒 −𝑧
3) Compute loss
𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy )
4) Compute derivative 𝐿 𝑦,
ො 𝑦 = −ylogොy−(1−y)log(1−ොy )
𝜕𝐿 𝜕𝐿
= 𝑥(𝑦ො − 𝑦) = (𝑦ො − 𝑦)
𝜕𝑤 𝜕𝑏
numbers
5) Update parameters
𝜕𝐿 𝜕𝐿
𝑤 =𝑤−𝜂 𝑏 =𝑏−𝜂
𝜕𝑤 𝜕𝑏
𝜂 is learning rate What will we do? 6
1) Pick a sample (𝑥, 𝑦) from training data

2) Compute the output 𝑦ො Traditional


Vectorization
𝑧 = 𝑤𝑥 + 𝑏
1
𝑦ො = 𝜎(𝑧) =
1 + 𝑒 −𝑧 𝑧 = 𝑤𝑥 + 𝑏 𝒙=
1
𝜽=
𝑏
3) Compute loss 𝑥 𝑤
𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy )
4) Compute derivative 𝜕𝐿
= 𝑦ො − 𝑦 = 𝑦ො − 𝑦 × 1
𝜕𝐿 𝜕𝐿 𝜕𝑏
= 𝑥(𝑦ො − 𝑦) = (𝑦ො − 𝑦)
𝜕𝑤 𝜕𝑏 𝜕𝐿
= 𝑥 𝑦ො − 𝑦 = 𝑦ො − 𝑦 × 𝑥
5) Update parameters 𝜕𝑤
𝜕𝐿 𝜕𝐿
𝑤 =𝑤−𝜂 𝑏 =𝑏−𝜂
𝜕𝑤 𝜕𝑏
𝜕𝐿
𝑦ො − 𝑦 × 1 1
= 𝑦ො − 𝑦 = 𝑦ො − 𝑦 𝒙 = 𝜕𝑏 = 𝛻𝜽 𝐿 → 𝛻𝜽 𝐿 = 𝒙(𝑦Ƹ − 𝑦)
𝑦ො − 𝑦 × 𝑥 𝑥 𝜕𝐿
𝜕𝑤
common factor 7
AI VIETNAM
All-in-One Course
Vectorization 𝑧 = 𝜽𝑇 𝒙
𝜕𝐿
1
𝒙=
𝑥 𝛻𝜽 𝐿 = 𝜕𝑏
𝜕𝐿
1) Pick a sample (𝑥, 𝑦) from training data 𝑏
𝜽= 𝜕𝑤
𝑤
2) Compute the output 𝑦ො
𝑧 = 𝑤𝑥 + 𝑏 Traditional
1 𝜕𝐿
𝑦ො = 𝜎(𝑧) = 𝑏 = 𝑏 − 𝜂
1 + 𝑒 −𝑧 𝜕𝑏
3) Compute loss
𝜕𝐿
𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy ) 𝑤 = 𝑤 − 𝜂
𝜕𝑤
4) Compute derivative

𝜕𝐿 𝜕𝐿
= 𝑥(𝑦ො − 𝑦) = (𝑦ො − 𝑦) 𝜽 𝜽 𝛻𝜽 𝐿
𝜕𝑤 𝜕𝑏
5) Update parameters
𝜕𝐿 𝜕𝐿
𝑤 =𝑤−𝜂 𝑏 =𝑏−𝜂 → 𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿
𝜕𝑤 𝜕𝑏
𝜂 is learning rate 8
AI VIETNAM
All-in-One Course
Vectorization

1) Pick a sample (𝑥, 𝑦) from training data 1) Pick a sample (𝐱, 𝑦) from training data

2) Compute the output 𝑦ො 2) Compute output 𝑦ො


1 1
𝑧 = 𝑤𝑥 + 𝑏 𝑦ො = 𝜎(𝑧) = 𝑧= 𝜽𝑇 𝒙 = 𝒙𝑇 𝜽 𝑦ො = 𝜎(𝑧) =
1 + 𝑒 −𝑧 1 + 𝑒 −𝑧
3) Compute loss
3) Compute loss
𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy ) 𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy )
4) Compute derivative
Traditional Vectorized
𝜕𝐿 𝜕𝐿 4) Compute derivative
= 𝑥(𝑦ො − 𝑦) = (𝑦ො − 𝑦)
𝜕𝑤 𝜕𝑏 𝛻𝜽 𝐿 = 𝒙(𝑦ො − 𝑦)
5) Update parameters
5) Update parameters
𝜕𝐿 𝜕𝐿
𝑤 =𝑤−𝜂 𝑏 =𝑏−𝜂 𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿
𝜕𝑤 𝜕𝑏
𝜂 is learning rate 𝜂 is learning rate
9
AI VIETNAM
All-in-One Course
Vectorization
❖ Implementation (using Numpy)

1) Pick a sample (𝒙, 𝑦) from training data

2) Compute output 𝑦ො
1
𝑧= 𝜽𝑇 𝒙 = 𝒙𝑇 𝜽 𝑦ො = 𝜎(𝑧) =
1 + 𝑒 −𝑧
3) Compute loss

𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy )

4) Compute derivative # Given X and y

𝛻𝜽 𝐿 = 𝒙(𝑦ො − 𝑦)

5) Update parameters
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿
𝜂 is learning rate 10
1 𝑏 0.1
1 1 Given 𝜽 = 𝑤1 = 0.5
𝒙 = 𝑥1 = 1.4 𝑤2 −0.1
𝑥2 0.2 𝜂 = 0.01
Dataset
Input 𝒙

0.1 Model
1) Pick a sample (𝒙, 𝑦) from training data 𝜽 = 0.5
−0.1
2) Compute output 𝑦ො Label
1
𝑧= 𝜽𝑇 𝒙 = 𝒙𝑇 𝜽 𝑦ො = 𝜎(𝑧) = 𝑦ො = 𝜎 𝜽𝑇 𝒙 = 0.6856 𝑦=0
1 + 𝑒 −𝑧
3) Compute loss
Loss
3
𝐿(𝑦,
ො 𝑦) = −ylogොy−(1−y)log(1−ොy ) 𝐿 = 1.1573
4 ′
1 0.6856 𝐿 𝑏
4) Compute derivative
𝛻𝜽 𝐿 = 𝒙 𝑦ො − 𝑦 = 1.4 0.6856 = 0.9599 = 𝐿′𝑤1
𝛻𝜽 𝐿 = 𝒙(𝑦ො − 𝑦) 0.2 0.1371 𝐿′𝑤2
5 0.1 0.6856 0.093
5) Update parameters
𝛉 − ηL′𝛉 = 0.5 − η 0.9599 = 0.499
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 −0.1 0.1371 −0.101
AI VIETNAM
All-in-One Course
Logistic Regression-Stochastic

Dataset 1) Pick a sample (𝒙, 𝑦) from training data


2) Compute output 𝑦ො
𝑧 = 𝜽𝑇 𝒙
1
𝑦ො = 𝜎(𝑧) =
1 + 𝑒 −𝑧
3) Compute loss
𝐿 𝜽 = −ylogොy−(1−y)log(1−ොy )
1
𝒙 = 1.4 𝒚= 0 4) Compute derivative
0.2
𝛻𝜽 𝐿 = 𝐱(ොy − 𝑦)

5) Update parameters
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿
𝜂 is learning rate
Demo
12
Outline
➢ Vectorization
➢ Optimiztion for 1+ samples
➢ Logistic Regression – Mini-batch
➢ Logistic Regression – Batch
➢ BCE and MSE Loss Functions
➢ Sigmoid and Tanh Function (Optional)
AI VIETNAM
All-in-One Course
Optimization for One+ Samples
❖ Equations for partial gradients
𝑑𝑓 𝑑𝑓
=𝑥 =1
𝑓 𝑥 𝑖 = 𝑎𝑥 𝑖 + 𝑏 (𝑥 1 =1, 𝑦1 =5) 𝑑𝑎 𝑑𝑏
2
illustration
𝑔 𝑓𝑖 = 𝑓𝑖 − 𝑦𝑖 (𝑥 2 =2, 𝑦 2 =7) 𝑑𝑔
=2 𝑓−𝑦
𝑑𝑓
𝑦𝑖

𝑑𝑔𝑖 𝑑𝑓 𝑖
𝑑𝑔 𝑑𝑔 𝑑𝑓
= = 2𝑥 𝑓 − 𝑦
𝑑𝑓 𝑖 𝑑𝑎 𝑑𝑎 𝑑𝑓 𝑑𝑎
𝑎 𝑑𝑓 𝑖
𝑑𝑎 𝑑𝑔𝑖 𝑑𝑔 𝑑𝑔 𝑑𝑓
𝑑𝑓 𝑖
= =2 𝑓−𝑦
𝑥𝑖 𝑓𝑖 𝑔𝑖 𝑑𝑏 𝑑𝑓 𝑑𝑏

𝑑𝑓 𝑖
𝑏 𝑑𝑔𝑖 𝑑𝑓 𝑖 During looking for optimal a
𝑑𝑏
𝑑𝑓 𝑖 𝑑𝑏 and b, at a given time, a and b
have concrete values
13
❖ Optimization for a composite function
𝑑𝑔1 𝑑𝑓 1 𝑦1
Find a and b so that g(f(x)) is minimum 𝑑𝑓 1 𝑑𝑎
𝑎 𝑑𝑓 1
𝑓 𝑥 𝑖 = 𝑎𝑥 𝑖 + 𝑏 (𝑥 1 =1,𝑦1 =5)
𝑑𝑎 𝑑𝑔1
𝑖 𝑖 𝑖 2 (𝑥 2 =2,𝑦 2 =7) 𝑑𝑓 1
𝑔 𝑓 = 𝑓 −𝑦 𝑥1 𝑓1 𝑔1

Partial derivative functions 𝑑𝑓 1


𝑏 𝑑𝑏 𝑑𝑔1 𝑑𝑓 1
𝑑𝑔 𝑑𝑔 𝑑𝑓 𝑑𝑓 1 𝑑𝑏
= = 2𝑥 𝑓 − 𝑦
𝑑𝑎 𝑑𝑓 𝑑𝑎
𝑑𝑔 𝑑𝑔 𝑑𝑓
= =2 𝑓−𝑦 𝑦2
𝑑𝑏 𝑑𝑓 𝑑𝑏 𝑑𝑔2 𝑑𝑓 2
𝑎 𝑑𝑓 2 𝑑𝑓 2 𝑑𝑎

𝑑𝑔𝑖 𝑑𝑔1 𝑑𝑓 1 𝑑𝑔2 𝑑𝑓 2 𝑑𝑎 𝑑𝑔2


෍ = + 𝑓2 𝑑𝑓 2 𝑔2
𝑑𝑎 𝑑𝑓 1 𝑑𝑎 𝑑𝑓 2 𝑑𝑎 𝑥2
𝑖
𝑑𝑓 2
𝑑𝑔𝑖 𝑑𝑔 𝑑𝑓 𝑑𝑔 𝑑𝑓 1 1 2 2 𝑏 𝑑𝑏 𝑑𝑔2 𝑑𝑓 2
෍ = 1 + 2 𝑑𝑓 2 𝑑𝑏
𝑑𝑏 𝑑𝑓 𝑑𝑏 𝑑𝑓 𝑑𝑏
𝑖
AI VIETNAM
All-in-One Course
Optimization
❖ How to use gradient information
1 update at
time t
info 1

update at
time t+1
info 2

2
info 1
update

info 2
Summary 1
info 1 𝑑𝑔 𝑑𝑔 𝑑𝑓
= = 2𝑥 𝑓 − 𝑦
𝑑𝑎 𝑑𝑓 𝑑𝑎
𝑑𝑔 𝑑𝑔 𝑑𝑓
info 2 = =2 𝑓−𝑦
𝑑𝑏 𝑑𝑓 𝑑𝑏

Initialize a, b

Compute partial
gradient at a, b

Move a, b
opposite to db, db

𝜂 = 0.01
16
Summary 2
info 1
𝑑𝑔 𝑑𝑔 𝑑𝑓
update = = 2𝑥 𝑓 − 𝑦
𝑑𝑎 𝑑𝑓 𝑑𝑎
info 2 𝑑𝑔 𝑑𝑔 𝑑𝑓
= =2 𝑓−𝑦
𝑑𝑏 𝑑𝑓 𝑑𝑏

Initialize a, b

Compute partial
gradient at a, b

Move a, b
opposite to db, db

𝜂 = 0.01
17
Summary 2
info 1
𝑑𝑔 𝑑𝑔 𝑑𝑓
update = = 2𝑥 𝑓 − 𝑦
𝑑𝑎 𝑑𝑓 𝑑𝑎
info 2 𝑑𝑔 𝑑𝑔 𝑑𝑓
= =2 𝑓−𝑦
𝑑𝑏 𝑑𝑓 𝑑𝑏

Initialize a, b

Compute partial
gradient at a, b

Move a, b
opposite to db, db

𝜂 = 0.001
18
Outline
➢ Vectorization
➢ Optimiztion for 1+ samples
➢ Logistic Regression – Mini-batch
➢ Logistic Regression – Batch
➢ BCE and MSE Loss Functions
➢ Sigmoid and Tanh Function (Optional)
AI VIETNAM
All-in-One Course
Linear Regression (m-samples)
𝒛 = 𝜽𝑇 𝒙
❖ Construct formulas 2) Compute output 𝑦ො 1
Dataset ෝ = 𝜎(𝒛) =
𝒚
1 + 𝑒 −𝒛
(1) (1)
1 1.5 0.2 1 𝑥1 𝑥2
𝒙= = (2) (2)
1 4.1 1.3 1 𝑥1 𝑥2

𝑏 0.1
𝜽 = 𝑤1 = 0.5
𝑤2 −0.1
0
𝒚=
1 (1)
𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏
(1)
𝒛= 𝑧 (1) =
𝒙=
1 1.5 0.2 𝑧 (2) (2)
𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏
(2)
1 4.1 1.3
1 𝑥1
(1) (1)
𝑥2 𝑏
𝑏 0.1 0.83
= 𝑤1 = 𝒙𝜽 =
𝜽 = 𝑤1 = 0.5 1 𝑥1
(2) (2)
𝑥2 2.02
𝑤2 −0.1 𝑤2
19
AI VIETNAM
All-in-One Course
Linear Regression (m-samples)
𝒛 = 𝜽𝑇 𝒙
❖ Construct formulas 2) Compute output 𝑦ො 1
Dataset ෝ = 𝜎(𝒛) =
𝒚
1 + 𝑒 −𝒛
(1) (1)
1 1.5 0.2 1 𝑥1 𝑥2
𝒙= = (2) (2)
1 4.1 1.3 1 𝑥1 𝑥2

𝑏 0.1
𝜽 = 𝑤1 = 0.5
𝑤2 −0.1
0 Numpy perspective
𝒚=
1
0.83
𝒛 = 𝒙𝜽 =
1 1.5 0.2 2.02
𝒙=
1 4.1 1.3
1
𝑏 1 (1) 1
0.1 𝑦ො 1 + 𝑒 −𝑧 0.69
𝜽 = 𝑤1 = 0.5 ෝ=𝜎 𝒛 =
𝒚 = = =
𝑦ො 2 1 1 + 𝑒 −𝒛 0.88
𝑤2 −0.1 (2)
1 + 𝑒 −𝑧 20
AI VIETNAM
All-in-One Course
Linear Regression (m-samples)
❖ Construct formulas
3) Compute loss
Dataset
1
𝐿(ෝ
𝒚, 𝒚) = −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
m

𝐿(1) 𝑦ො (1) , 𝑦 (1) + 𝐿(2) 𝑦ො (2) , 𝑦 (2)


𝐿(ෝ
𝒚, 𝒚) =
m

0
𝒚=
1
𝐿(1) 𝑦ො (1) , 𝑦 (1) = −𝑦 (1) log𝑦ො (1) −(1−𝑦 (1) )log(1−𝑦ො (1) )
1 1.5 0.2
𝒙= +
1 4.1 1.3
𝐿(2) 𝑦ො (2) , 𝑦 (2) = −𝑦 (2) log𝑦ො (2) −(1−𝑦 (2) )log(1−𝑦ො (2) )
𝑏 0.1
𝜽 = 𝑤1 = 0.5
𝑤2 −0.1 𝐲 T log𝐲ො (1−y)T log(1−𝐲ො )
21
4) Compute derivative 𝜕𝐿(1) 𝜕𝐿(2)
𝜕𝐿 + (𝑦ො (1) − 𝑦 (1) ) + (𝑦ො (2) − 𝑦 (2) )
= 𝜕𝑏 𝜕𝑏 =
𝜕𝑏 𝑚 𝑚 (1)
𝜕𝐿 (1) sample 1 1 𝑥0 = 1
= (𝑦ො (1) − 𝑦 (1) ) = 1 ∗ (𝑦ො (1) − 𝑦 (1) ) + 1 ∗ (𝑦ො (2) − 𝑦 (2) )
𝜕𝑏 𝑚 (2)
𝑥0 = 1
1 (1) (2) 𝑦 ො (1) − 𝑦 (1)
𝜕𝐿(1) (1) = 𝑥 𝑥0
= 𝑥1 (𝑦ො (1) − 𝑦 (1) ) 𝑚 0 𝑦ො (2) − 𝑦 (2)
𝜕𝑤1
𝜕𝐿(1) (1) 𝜕𝐿(1) 𝜕𝐿(2)
= 𝑥2 (𝑦ො (1) − 𝑦 (1) ) 𝜕𝐿 + 𝑥
(1) (1)
(𝑦ො − 𝑦 (1) ) + 𝑥 (2) (𝑦
ො (2) − 𝑦 (2) )
𝜕𝑤1 𝜕𝑤1 1 1
𝜕𝑤2 = =
𝜕𝑤1 𝑚 𝑚

sample 2 1 (1) (2) 𝑦ො (1) − 𝑦 (1)


(2) = 𝑥 𝑥1
𝜕𝐿 𝑚 1 𝑦ො (2) − 𝑦 (2)
= (𝑦ො (2) − 𝑦 (2) )
𝜕𝑏
𝜕𝐿(2) (2)
= 𝑥1 (𝑦ො (2) − 𝑦 (2) ) 𝜕𝐿(1) 𝜕𝐿(2)
𝜕𝑤1 𝜕𝐿 + (1) (2)
𝑥2 (𝑦ො (1) − 𝑦 (1) ) + 𝑥2 (𝑦ො (2) − 𝑦 (2) )
𝜕𝑤2 𝜕𝑤2
= =
𝜕𝑤2 𝑚 𝑚
𝜕𝐿(2) (2)
= 𝑥2 (𝑦ො (2) − 𝑦 (2) ) 1 (1) 𝑦ො (1) − 𝑦 (1)
𝜕𝑤2 = 𝑥
(2)
𝑥2
𝑚 2 𝑦ො (2) − 𝑦 (2) 22
4) Compute derivative 𝜕𝐿 1 (1) (2) 𝑦ො (1) − 𝑦 (1)
= 𝑥 𝑥0
𝜕𝑏 𝑚 0 𝑦ො (2) − 𝑦 (2)
𝜕𝐿(1) sample 1
𝜕𝐿 1 (1) (2) 𝑦ො (1) − 𝑦 (1)
= (𝑦ො (1) − 𝑦 (1) ) = 𝑥 𝑥1
𝜕𝑏 𝜕𝑤1 𝑚 1 𝑦ො (2) − 𝑦 (2)
𝜕𝐿(1) (1) 𝜕𝐿 1 (1) (2) 𝑦ො (1) − 𝑦 (1)
= 𝑥1 (𝑦ො (1) − 𝑦 (1) ) = 𝑥 𝑥2
𝜕𝑤1 𝜕𝑤2 𝑚 2 𝑦ො (2) − 𝑦 (2)

𝜕𝐿(1) (1) 𝜕𝐿
= 𝑥2 (𝑦ො (1) − 𝑦 (1) ) (1) (2)
𝜕𝑤2 𝜕𝑏 𝑥0 𝑥0
𝜕𝐿 1 (1) 𝑦ො (1) − 𝑦 (1)
𝛻𝜽 𝑳 =
𝜕𝑤1
= 𝑥 𝑥1(2)
(2) sample 2 𝑚 1 𝑦ො (2) − 𝑦 (2)
𝜕𝐿 𝜕𝐿 𝑥2(1) 𝑥2(2)
= (𝑦ො (2) − 𝑦 (2) )
𝜕𝑏 𝜕𝑤2
𝜕𝐿(2) 𝑇
𝐲ො 𝐲
(2)
𝒙(1)
= 𝑥1 (𝑦ො (2) − 𝑦 (2) )
𝜕𝑤1 (2) 𝑇
𝒙
𝜕𝐿(2) (2)
= 𝑥2 (𝑦ො (2) − 𝑦 (2) ) 1 𝑇
𝜕𝑤2 𝛻𝜽 𝐿 = 𝒙 (𝐲ො − 𝒚)
𝑚 23
5) Update parameters
𝜕𝐿
𝜕𝑏
Dataset 𝜕𝐿 1 𝑇
𝛻𝜽 𝑳 = = 𝒙 (𝐲ො − 𝒚)
𝜕𝑤1 𝑚
𝜕𝐿
𝜕𝑤2

1 1.5 0.2 𝜕𝐿
𝒙= 𝑏= 𝑏 − 𝜂
1 4.1 1.3 𝜕𝑏
𝜕𝐿
0 𝑤1 = 𝑤1 − 𝜂
𝒚= 𝜕𝑤1 𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿
1
𝜕𝐿
𝑏 0.1 𝑤2 = 𝑤2 − 𝜂
𝜕𝑤2
𝜽 = 𝑤1 = 0.5
𝑤2 −0.1
𝜽 𝜽 𝛻𝜽 𝑳
24
AI VIETNAM
All-in-One Course
Logistic Regression - Minibatch
Mini-batch m=2
1) Pick m samples from training data (1) (1) 𝑏
1 𝑥1 𝑥2

2) Compute output 𝒚 𝒙= 𝜽 = 𝑤1
(2) (2)
1 𝑥1 𝑥2 𝑤2
𝒛 = 𝒙𝜽
1
ෝ = 𝜎(𝒛) =
𝒚
1 + 𝑒 −𝒛 Input 𝒙
3) Compute loss
1
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T
log(1−𝐲ො ) 𝑏 Model
m
𝜽 = 𝑤1
4) Compute derivative 𝑤2
1 T Label
𝛻𝜽 𝐿 = 𝐱 (𝐲ො − 𝒚)
m ෝ = 𝜎 𝒙𝜽
𝒚 𝒚
5) Update parameters
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 Loss
𝜂 is learning rate 𝐿(𝜽) =…
25
Dataset
2 0.1
1 1.5 0.2
ෝ = 𝜎 𝒙𝜽 = 𝜎
𝒚 0.5
1 4.1 1.3
−0.1
0.83 0.6963
=𝜎 =
2.02 0.8828

1 1.5 0.2
𝒙=
1) Pick m samples from training data 1 4.1 1.3

2) Compute output 𝒚 Mini-batch m=2
Model
𝒛 = 𝒙𝜽 0.1
1 𝜽 = 0.5
ෝ = 𝜎(𝒛) =
𝒚
1 + 𝑒 −𝒛 −0.1
3) Compute loss
1 0
𝒚=
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො) ෝ = 𝜎 𝒙𝜽 =
𝒚
0.6963 1
m 0.8828
4) Compute derivative
1 T Loss
𝛻𝜽 𝐿 = 𝐱 (𝐲ො − 𝒚)
m
𝐿(𝜽) = …
5) Update parameters
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 26
3
1 log0.6963 log(1 − 0.6963)
𝐿(𝜽) = −0 1 −1 0
𝑦 = log(𝑥) m log0.8828 log(1 − 0.8828)
1
= −log0.8828−log(1 − 0.6963)
m
0.1246 + 1.1917
= = 0.65815
m
1) Pick m samples from training data
1 1.5 0.2

2) Compute output 𝒚 Mini-batch m=2 𝒙=
1 4.1 1.3
𝒛 = 𝒙𝜽
1 Model
ෝ = 𝜎(𝒛) =
𝒚 0.1
1 + 𝑒 −𝒛
3) Compute loss 𝜽 = 0.5
1 −0.1
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
m 0
𝒚=
0.6963 1
ෝ = 𝜎 𝒙𝜽 =
𝒚
4) Compute derivative 0.8828
1 T
𝛻𝜽 𝐿 = 𝐱 (𝐲ො − 𝒚)
m
Loss
5) Update parameters
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 𝐿(𝜽) = 0.65815
Dataset
1 1.5 0.2
𝒙=
1 4.1 1.3

Model
0.1
𝜽 = 0.5
−0.1
0
1) Pick m samples from training data 𝒚=
0.6963 1
ෝ = 𝜎 𝒙𝜽 =
𝒚

2) Compute output 𝒚 Mini-batch m=2 0.8828
𝒛 = 𝒙𝜽
1
ෝ = 𝜎(𝒛) =
𝒚 Loss
1 + 𝑒 −𝒛 3
3) Compute loss 𝐿(𝜽) = 0.65815
1
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
m 4
1 1 1
0.6963 0
4) Compute derivative 𝛻𝜽 𝐿 = 1.5 4.1 −
m 0.8828 1
1 T 0.2 1.3
𝛻𝜽 𝐿 = 𝐱 (𝐲ො − 𝒚)
m 1 1 1 0.28961
0.6963
5) Update parameters = 1.5 4.1 = 0.28217
m −0.1171
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 0.2 1.3 −0.0064
Dataset
1 1.5 0.2
𝒙=
1 4.1 1.3

Model
0.1
𝜽 = 0.5
−0.1
0
1) Pick m samples from training data 𝒚=
0.6963 1
ෝ = 𝜎 𝒙𝜽 =
𝒚

2) Compute output 𝒚 Mini-batch m=2 0.8828
𝒛 = 𝒙𝜽
1
ෝ = 𝜎(𝒛) =
𝒚 Loss
1 + 𝑒 −𝒛 3
3) Compute loss 𝐿(𝜽) = 0.65815
1
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
m 4 0.28961
4) Compute derivative
𝛻𝜽 𝐿 = 0.28217
1 T −0.0064
𝛻𝜽 𝐿 = 𝐱 (𝐲ො − 𝒚)
m 0.1 0.28961 0.0971
5
5) Update parameters 𝛉 − ηL′𝛉 = 0.5 − η 0.28217 = 0.4971
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 −0.1 −0.0064 −0.099 29
Outline
➢ Vectorization
➢ Optimiztion for 1+ samples
➢ Logistic Regression – Mini-batch
➢ Logistic Regression – Batch
➢ BCE and MSE Loss Functions
➢ Sigmoid and Tanh Function (Optional)
Logistic Regression - Batch
(1) (1)
1 𝑥1 𝑥2
(2) (2)
1) Pick all the samples from training data 1 𝑥1 𝑥2
𝒙= (3) (3)
2) Compute output 𝑦ො 1 𝑥1 𝑥2
(4) (4)
𝑧 = 𝒙𝜽 1 𝑥1 𝑥2
1 𝑏
ෝ = 𝜎(𝒛) =
𝒚
Input 𝒙 𝜽 = 𝑤1
1 + 𝑒 −𝒛 𝑤2
3) Compute loss Model
1 𝑏
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
N 𝜽 = 𝑤1
𝑤2
4) Compute derivative
1 T Label
𝛻𝜽 𝐿 = 𝒙 (𝐲ො − 𝒚)
N ෝ = 𝜎 𝜽𝑇 𝒙
𝒚 𝒚
5) Update parameters
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 Loss
𝜂 is learning rate 𝐿(𝜽) =…
30
Dataset
Logistic Regression - Batch
1 1.4 0.2 0
𝑏
1 1.5 0.2 0
𝒙= 𝒚= 𝜽 = 𝑤1
1 3.0 1.1 1
𝑤2
1 4.1 1.3 1
1) Pick all the samples from training data

2) Compute output 𝒚 Input 𝒙
𝑧 = 𝒙𝜽
1 Model
ෝ = 𝜎(𝒛) =
𝒚
1 + 𝑒 −𝒛 𝑏
3) Compute loss 𝜽 = 𝑤1
1 𝑤2
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
N Label
4) Compute derivative ෝ = 𝜎 𝒙𝜽
𝒚 𝒚
1 T
𝛻𝜽 𝐿 = 𝒙 (𝐲ො − 𝒚)
N Loss
5) Update parameters
𝐿(𝜽) =…
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 31
Dataset 2
1 1.4 0.2
1 1.4 0.2
0.1
1 1.5 0.2 1 1.5 0.2
𝒙= ෝ = 𝜎 𝒙𝜽 = 𝜎
𝒚 0.5
1 3.0 1.1 1 3.0 1.1
−0.1
1 4.1 1.3 1 4.1 1.3
0.78 0.6856
0.83 0.6963
=𝜎 =
1.49 0.8160
1) Pick all the samples from training data 2.02 0.8828

2) Compute output 𝒚
𝒙
𝑧 = 𝒙𝜽
1
ෝ = 𝜎(𝒛) =
𝒚 Model
1 + 𝑒 −𝒛 0.1
3) Compute loss 𝜽 = 0.5
1 −0.1
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
N

4) Compute derivative 1 𝒚
ෝ = 𝜎 𝒙𝜽 =
𝒚
1 T 1 + 𝑒 −𝒛
𝛻𝜽 𝐿 = 𝒙 (𝐲ො − 𝒚)
N
Loss
5) Update parameters
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 𝐿(𝜽) = …
32
3 T 𝑇
0 0.6856 1 0 0.6856
1 0 0.6963 1 0 0.6963
𝑦 = log(𝑥) 𝐿(𝜽) = − log − − log 1−
N 1 0.8160 1 1 0.8160
1 0.8828 1 1 0.8828

1) Pick all the samples from training data


𝒙

2) Compute output 𝒚
𝑧 = 𝒙𝜽 Model 0
1 0.1 0
ෝ = 𝜎(𝒛) =
𝒚 𝜽 = 0.5 𝒚=
1 + 𝑒 −𝒛 1
3) Compute loss −0.1 1
1
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
N 1 𝒚
ෝ = 𝜎 𝒙𝜽 =
𝒚
4) Compute derivative 0.6856 1 + 𝑒 −𝒛
1 T 0.6963
ෝ=
𝒚 Loss
𝛻𝜽 𝐿 = 𝒙 (𝐲ො − 𝒚) 0.8160
N
0.8828
5) Update parameters 𝐿(𝜽) = …
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 33
3 T 𝑇
0 0.6856 1 0.3144
1 0 0.6963 1 0.3037
𝑦 = log(𝑥) 𝐿(𝜽) = − log − log
N 1 0.8160 0 0.1840
1 0.8828 0 0.1172
1
= −log0.8160−log0.8828 − log0.3144−log0.3037
N
= 0.6691
1) Pick all the samples from training data
𝒙

2) Compute output 𝒚
𝑧 = 𝒙𝜽 Model 0
1 0.1 0
ෝ = 𝜎(𝒛) =
𝒚 𝜽 = 0.5 𝒚=
1 + 𝑒 −𝒛 1
3) Compute loss −0.1 1
1
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
N 1 𝒚
ෝ = 𝜎 𝒙𝜽 =
𝒚
4) Compute derivative 0.6856 1 + 𝑒 −𝒛
1 T 0.6963
ෝ=
𝒚
𝛻𝜽 𝐿 = 𝒙 (𝐲ො − 𝒚) 0.8160 Loss
N 0.8828
5) Update parameters 𝐿(𝜽) = 0.6691
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 34
𝒙
1 1.4 0.2 0.6856
1 1.5 0.2 0.6963 Model 0
𝒙= ෝ=
𝒚 0.1
1 3.0 1.1 0.8160 0
𝜽 = 0.5 𝒚=
1 4.1 1.3 0.8828 1
−0.1 1

1 𝒚
ෝ = 𝜎 𝒙𝜽 =
𝒚
1) Pick all the samples from training data 1 + 𝑒 −𝒛

2) Compute output 𝒚
Loss
𝑧 = 𝒙𝜽 3
1 𝐿(𝜽) = 0.6691
ෝ = 𝜎(𝒛) =
𝒚
1 + 𝑒 −𝒛
3) Compute loss 4
1 0.6856 0
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො) 1 1 1 1 1
0.6963 0
N 𝛻𝜽 𝐿 = 1.4 1.5 3.0 4.1 −
N 0.8160 1
4) Compute derivative 0.2 0.2 1.1 1.3
0.8828 1
1 T
𝛻𝜽 𝐿 = 𝒙 (𝐲ො − 𝒚) 0.6856
N 1 1 1 1 1 0.2702
0.6963
5) Update parameters = 1.4 1.5 3.0 4.1 = 0.2431
N −0.184
0.2 0.2 1.1 1.3 −0.019
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 −0.117
Dataset
𝒙

Model
0.1
𝜽 = 0.5
−0.1

1) Pick all the samples from training data 1 𝒚


ෝ = 𝜎 𝒙𝜽 =
𝒚

2) Compute output 𝒚 1 + 𝑒 −𝒛
𝑧 = 𝒙𝜽
1
ෝ = 𝜎(𝒛) =
𝒚 Loss
1 + 𝑒 −𝒛 3
3) Compute loss 𝐿(𝜽) = 0.6691
1
ො 𝒚) =
𝐿(𝐲, −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
N
4 0.2702
4) Compute derivative 𝛻𝜽 𝐿 = 0.2431
1 T −0.019
𝛻𝜽 𝐿 = 𝒙 (𝐲ො − 𝒚)
N
5 0.1 0.2702 0.0971
5) Update parameters
𝛉 − ηL′𝛉 = 0.5 − η 0.2431 = 0.4971
𝜽 = 𝜽 − 𝜂𝛻𝜽 𝐿 −0.1 −0.019 −0.099
Outline
➢ Vectorization
➢ Optimiztion for 1+ samples
➢ Logistic Regression – Mini-batch
➢ Logistic Regression – Batch
➢ BCE and MSE Loss Functions
➢ Sigmoid and Tanh Function (Optional)
AI VIETNAM
All-in-One Course
Hessian Matrices
❖ Definition

The Hessian matrix or Hessian is a


square matrix of second-order partial
Given 𝑓 𝑥, 𝑦 = 𝑥 2 + 2𝑥 2 𝑦 + 𝑦 3
derivatives of a scalar-valued function
https://en.wikipedia.org/wiki/Hessian_matrix 𝜕𝑓
= 2𝑥 + 4𝑥𝑦
𝜕𝑥
Given 𝑓(𝑥, 𝑦) 𝜕𝑓
= 2𝑥 2 + 3𝑦 2
𝑓: 𝑅2 → 𝑅 𝜕𝑦
𝜕2𝑓 𝜕2𝑓
𝜕𝑥 2 𝜕𝑥𝜕𝑦
𝐻𝑓 = 2 + 4𝑦 4𝑥
𝜕2𝑓 𝜕2𝑓 𝐻𝑓 =
4𝑥 6𝑦
𝜕𝑥𝜕𝑦 𝜕𝑦 2

37
𝜕𝐿 𝜕𝐿 𝜕𝑦ො 𝜕𝑧 Derivative
Binary Cross-entropy =
𝜕𝜃𝑖 𝜕𝑦ො 𝜕𝑧 𝜕𝜃𝑖
𝜕𝐿 𝑦 1−𝑦 𝑦ො − 𝑦
=− + =
❖ Convex function 𝜕𝑦ො 𝑦ො 1 − 𝑦ො 𝑦(1
ො − 𝑦) ො

Model and Loss 𝜕𝑦ො


𝑇 = 𝑦(1
ො − 𝑦)

𝑧=𝜽 𝒙 𝜕𝑧 𝜕𝐿
= 𝑥𝑖 (𝑦ො − 𝑦)
1 𝜕𝑧 𝜕𝜃𝑖
𝑦ො = 𝜎(𝑧) = = 𝑥𝑖
1 + 𝑒 −𝑧 𝜕𝜃𝑖

L = −𝑦log 𝑦ො − (1 − 𝑦)log 1 − 𝑦ො

𝜕𝐿
= 𝑥𝑖 (𝑦ො − 𝑦)
𝜕𝜃𝑖
𝜕2𝐿 𝜕 2 2) ≥ 0
= 𝑥𝑖 ( 𝑦
ො − 𝑦) = 𝑥𝑖 ( 𝑦
ො − 𝑦

𝜕𝜃𝑖2 𝜕𝜃𝑖

1
𝑥𝑖2 ≥0 𝑦ො − 𝑦ො 2 ∈ 0,
4
AI VIETNAM
All-in-One Course
Logistic Regression-MSE
❖ Construct loss

Model and Loss Derivative

𝑧 = 𝜽𝑇 𝒙 = 𝒙𝑇 𝜽 𝜕𝐿 𝜕𝐿 𝜕𝑦ො 𝜕𝑧 𝜕 𝑦ො
= = 𝑦(1
ො − 𝑦)

1 𝜕𝜃𝑖 𝜕𝑦ො 𝜕𝑧 𝜕𝜃𝑖 𝜕𝑧
𝑦ො = 𝜎(𝑧) =
1 + 𝑒 −𝑧
𝜕𝐿 𝜕𝑧
L = (𝑦ො − 𝑦)2 = 2(𝑦ො − 𝑦) = 𝑥𝑖
𝜕𝑦ො 𝜕𝜃𝑖

𝜕𝐿
= 2𝑥𝑖 (𝑦ො − 𝑦)𝑦(1
ො − 𝑦)

𝜕𝜃𝑖
39
Mean Squared Error Derivative

𝜕𝐿 𝜕𝐿 𝜕𝑦ො 𝜕𝑧 𝜕𝑦ො
Model and Loss = 𝑦(1
ො − 𝑦)

= 𝜕𝑧
𝜕𝜃𝑖 𝜕𝑦ො 𝜕𝑧 𝜕𝜃𝑖
𝑧 = 𝜽𝑇 𝒙 = 𝒙𝑇 𝜽 𝜕𝑧
𝜕𝐿 = 𝑥𝑖
1 = 2(𝑦ො − 𝑦) 𝜕𝜃𝑖
𝑦ො = 𝜎(𝑧) = 𝜕𝑦ො
1 + 𝑒 −𝑧
𝜕𝐿
L = (𝑦ො − 𝑦)2 = 2𝑥𝑖 (𝑦ො − 𝑦)𝑦(1
ො − 𝑦)

𝜕𝜃𝑖

𝜕𝐿
= 2𝑥𝑖 (𝑦ො − 𝑦)𝑦(1 ො = 2𝑥𝑖 −𝑦ො 3 + 𝑦ො 2 − 𝑦𝑦ො + 𝑦𝑦ො 2
ො − 𝑦)
𝜕𝜃𝑖

𝜕2𝐿 𝜕 3 2 2
= 2𝑥𝑖 −𝑦ො + 𝑦
ො − 𝑦 𝑦
ො + 𝑦 𝑦

𝜕𝜃𝑖2 𝜕𝜃𝑖
= 2𝑥𝑖 −3𝑦ො 2 𝑥𝑖 𝑦ො 1 − 𝑦ො + 2𝑥𝑖 𝑦ො 𝑦ො 1 − 𝑦ො − 𝑦𝑥𝑖 𝑦(1
ො − 𝑦)
ො + 2𝑥𝑖 𝑦𝑦ො 𝑦ො 1 − 𝑦ො

= 2𝑥𝑖2 𝑦(1 ො −3𝑦ො 2 + 2𝑦ො − 𝑦 + 2𝑦𝑦ො


ො − 𝑦) 40
Mean Squared Error
𝜕2𝐿 2 2 + 2𝑦
2 = 2𝑥𝑖 𝑦(1
ො − 𝑦)
ො −3 𝑦
ො ො − 𝑦 + 2𝑦𝑦ො
𝜕𝜃𝑖

𝑥𝑖2 ≥ 0 ො = −3𝑦ො 2 + 2𝑦ො


𝑓(𝑦)
1
𝑦(1
ො − 𝑦)
ො ∈ 0,
4

𝑦=0
ො = −3𝑦ො 2 + 2𝑦ො
𝑓(𝑦)

𝑦=1
𝑓 𝑦ො = −3𝑦ො 2 + 4𝑦ො − 1
𝑓 𝑦ො = −3𝑦ො 2 + 4𝑦ො − 1
41
AI VIETNAM
All-in-One Course
MSE and BCE
❖ Visualization

Mean Squared Error Binary Cross-Entropy


42
AI VIETNAM
All-in-One Course
Sigmoid and Tanh Functions
1 𝑒 𝑥 − 𝑒 −𝑥
sigmoid 𝑥 = tanh 𝑥 = 𝑥
1 + 𝑒 −𝑥 𝑒 + 𝑒 −𝑥

35
1 2

3 4
AI VIETNAM
All-in-One Course
Sigmoid and Tanh Functions
1 1
sigmoid 2𝑥 = tanh 𝑥 = 2 × −2𝑥
−1
1 + 𝑒 −2𝑥 1+𝑒

tanh 𝑥 = 2 × sigmoid 2𝑥 − 1
Outline
➢ Vectorization
➢ Optimiztion for 1+ samples
➢ Logistic Regression – Mini-batch
➢ Logistic Regression – Batch
➢ BCE and MSE Loss Functions
➢ Sigmoid and Tanh Function (Optional)
AI VIETNAM
All-in-One Course
Tanh function

𝑒 𝑥 − 𝑒 −𝑥 𝑒 2𝑥 − 1
tanh 𝑥 = 𝑥 −𝑥 = 2𝑥
𝑒 +𝑒 𝑒 +1
2
= 1 − 2𝑥
𝑒 +1

𝑒 𝑥 − 𝑒 −𝑥 1 − 𝑒 −2𝑥
tanh 𝑥 = 𝑥 =
𝑒 + 𝑒 −𝑥 1 + 𝑒 −2𝑥
𝑒 −2𝑥 − 1 2
= − −2𝑥 = −2𝑥 −1
𝑒 +1 𝑒 +1
AI VIETNAM
All-in-One Course
Tanh function
𝑒 𝑥 − 𝑒 −𝑥 𝑒 2𝑥 − 1 2
tanh 𝑥 = 𝑥 −𝑥
= 2𝑥 = 1 − 2𝑥
𝑒 +𝑒 𝑒 +1 𝑒 +1

𝑒 𝑥 − 𝑒 −𝑥 1 − 𝑒 −2𝑥 𝑒 −2𝑥 − 1 2
tanh 𝑥 = 𝑥 = = − −2𝑥 = −1
𝑒 + 𝑒 −𝑥 1 + 𝑒 −2𝑥 𝑒 + 1 𝑒 −2𝑥 + 1

𝑥 − 𝑒 −𝑥 ′
𝑒 𝑒 𝑥 + 𝑒 −𝑥 𝑒 𝑥 + 𝑒 −𝑥 − 𝑒 𝑥 − 𝑒 −𝑥 𝑒 𝑥 − 𝑒 −𝑥
𝑡𝑎𝑛ℎ′ (𝑥) = 𝑥 =
𝑒 + 𝑒 −𝑥 𝑒 𝑥 + 𝑒 −𝑥 2
𝑒 𝑥 + 𝑒 −𝑥 2 − 𝑒 𝑥 − 𝑒 −𝑥 2
=
𝑒 𝑥 + 𝑒 −𝑥 2
2
𝑒 𝑥 − 𝑒 −𝑥
=1− 𝑥 = 1 − 𝑡𝑎𝑛ℎ2 (𝑥)
𝑒 + 𝑒 −𝑥
AI VIETNAM
All-in-One Course
Tanh function
𝑒 𝑥 − 𝑒 −𝑥 𝑒 2𝑥 − 1 2
tanh 𝑥 = 𝑥 −𝑥
= 2𝑥 = 1 − 2𝑥
𝑒 +𝑒 𝑒 +1 𝑒 +1

𝑒 𝑥 − 𝑒 −𝑥 1 − 𝑒 −2𝑥 𝑒 −2𝑥 − 1 2
tanh 𝑥 = 𝑥 = = − −2𝑥 = −1
𝑒 + 𝑒 −𝑥 1 + 𝑒 −2𝑥 𝑒 + 1 𝑒 −2𝑥 + 1

2 4𝑒 −2𝑥 𝑒 −2𝑥 + 1 − 1
𝑡𝑎𝑛ℎ′ (𝑥) = −1 = −2𝑥 =4
𝑒 −2𝑥 + 1 𝑒 +1 2 𝑒 −2𝑥 + 1 2

11 4 4
= 4 −2𝑥 − −2𝑥 2 =− −
𝑒 +1 𝑒 +1 𝑒 −2𝑥 + 1 2 𝑒 −2𝑥 + 1
2
4 4 2
=− 2− +1 −1 =1− −1 = 1 − 𝑡𝑎𝑛ℎ2 (𝑥)
𝑒 −2𝑥 + 1 𝑒 −2𝑥 + 1 𝑒 −2𝑥 + 1
Logistic Regression 𝑧 = 𝜽𝑇 𝒙 = 𝒙𝑇 𝜽
Model and Loss

Tanh 𝑒 𝑧 − 𝑒 −𝑧
𝑦ො = 𝑡𝑎𝑛ℎ(𝑧) = 𝑧
𝑒 + 𝑒 −𝑧
❖ Construct loss 𝑦ො + 1
𝑦ො𝑠 =
2
L = −𝑦log 𝑦ො𝑠 − (1 − 𝑦)log 1 − 𝑦ො𝑠
2
tanh 𝑥 = −2𝑥
−1
1+𝑒 𝜕𝐿 𝜕𝐿 𝜕𝑦ො𝑠 𝜕𝑦ො 𝜕𝑧 Derivative
=
𝜕𝜃𝑖 𝜕𝑦ො𝑠 𝜕𝑦ො 𝜕𝑧 𝜕𝜃𝑖
𝜕𝐿 𝑦 1−𝑦 𝑦ො𝑠 − 𝑦
=− + =
𝜕𝑦ො𝑠 𝑦ො𝑠 1 − 𝑦ො𝑠 𝑦ො𝑠 (1 − 𝑦ො𝑠 )

𝜕𝑦ො𝑠 1 𝜕𝑦ො 𝜕𝑧
= = 1 − 𝑦ො 2 = 𝑥𝑖
𝜕𝑦ො 2 𝜕𝑧 𝜕𝜃𝑖

𝜕𝐿 (𝑦ො𝑠 − 𝑦)(1 − 𝑦ො 2 )
= 𝑥𝑖
𝜕𝜃𝑖 2𝑦ො𝑠 (1 − 𝑦ො𝑠 )
Logistic Regression 𝜕𝐿
=
𝜕𝐿 𝜕𝑦ො𝑠 𝜕𝑦ො 𝜕𝑧 Derivative

Tanh 𝜕𝜃𝑖 𝜕𝑦ො𝑠 𝜕𝑦ො 𝜕𝑧 𝜕𝜃𝑖


𝜕𝐿 𝑦 1−𝑦 𝑦ො𝑠 − 𝑦
❖ Construct loss 𝜕𝑦ො𝑠
=− + =
𝑦ො𝑠 1 − 𝑦ො𝑠 𝑦ො𝑠 (1 − 𝑦ො𝑠 )

2 𝜕𝑦ො𝑠 1
=
𝜕𝑦ො
= 1 − 𝑦ො 2
𝜕𝑧
= 𝑥𝑖
tanh 𝑥 = −2𝑥
−1 𝜕𝑦ො 2 𝜕𝑧 𝜕𝜃𝑖
1+𝑒
𝜕𝐿 (𝑦ො𝑠 − 𝑦)(1 − 𝑦ො 2 )
Model and Loss = 𝑥𝑖
𝑧= 𝜽𝑇 𝒙 = 𝒙𝑇 𝜽 𝜕𝜃𝑖 2𝑦ො𝑠 (1 − 𝑦ො𝑠 )
𝑒 𝑧 − 𝑒 −𝑧
𝑦ො = 𝑡𝑎𝑛ℎ(𝑧) = 𝑧 𝑦ො + 1
𝑒 + 𝑒 −𝑧 𝜕𝐿 ( − 𝑦)(1 − 𝑦ො 2 )
= 𝑥𝑖 2
𝑦ො + 1 𝑦ො + 1 𝑦ො + 1
𝑦ො𝑠 = 𝜕𝜃𝑖
2 2 (1 − )
2 2
L = −𝑦log 𝑦ො𝑠 − (1 − 𝑦)log 1 − 𝑦ො𝑠 𝜕𝐿 (𝑦ො + 1 − 2𝑦)(1 − 𝑦ො 2 )
= 𝑥𝑖
𝜕𝜃𝑖 (𝑦ො + 1)(1 − 𝑦)

𝜕𝐿
= 𝑥𝑖 (𝑦ො + 1 − 2𝑦)
𝜕𝜃𝑖
Summary
1) Pick all the samples from training data

2) Compute output 𝒚
𝑦2 𝒛 = 𝒙𝜽
1
ෝ = 𝜎(𝒛) =
𝒚
1 + 𝑒 −𝒛
1 3) Compute loss (binary cross-entropy)
𝒚 𝑦=
1 + 𝑒 −𝑥 1
𝐿(𝜽) = −𝐲 T log𝐲−(1−y)
ො T log(1−𝐲
ො)
𝑦1 N
4) Compute derivative
𝑥1 𝑥2
1 T
𝛻𝜽 𝐿 = 𝒙 (𝐲ො − 𝒚)
N
𝑥 5) Update parameters
−∞ +∞
𝜽 = 𝜽 − 𝜂𝐿′𝜽
Sigmoid function
𝜂 is learning rate

You might also like