Introduction of Machine Learning
Introduction of Machine Learning
Spam
filtering f Yes/No
Different types of Functions
Classification: Given options (classes), the function
outputs the correct one.
Each position
is a class
(19 x 19 classes)
Function
a position on
the board
Next move
Playing GO
Structured Learning
create something with
structure (image, document)
Regression,
Classification
How to find a function?
A Case Study
YouTube Channel
https://www.youtube.com/c/HungyiLeeNTU
The function we want to find …
𝑦=𝑓
no. of views
on 2/26
1. Function
with Unknown Parameters
𝑦=𝑓
0.5𝑘+1𝑥! = 𝑦 5.3k
4.9k
Ø Loss is a function of
2. Define Loss parameters 𝐿 𝑏, 𝑤
from Training Data Ø Loss: how good a set of
values is.
𝐿 0.5𝑘, 1 𝑦 = 𝑏 + 𝑤𝑥! 𝑦 = 0.5𝑘 + 1𝑥! How good it is?
Data from 2017/01/01 – 2020/12/31
2017/01/01 01/02 01/03 …… 2020/12/30 12/31
4.8k 4.9k
𝑏 + 𝑤𝑥! = 𝑦 1
Loss: 𝐿 = , 𝑒"
𝑒! 𝑁
"
𝑦&
4.9k
𝑏 Error Surface
Large 𝐿 𝑤
Source of image: http://chico386.pixnet.net/album/photo/171572850
Gradient Descent
Ø (Randomly) Pick an initial value 𝑤 (
𝜕𝐿
Ø Compute |%)%!
𝜕𝑤
Loss
𝐿 Negative Increase w
Positive Decrease w
𝑤( 𝑤
Source of image: http://chico386.pixnet.net/album/photo/171572850
Gradient Descent
Ø (Randomly) Pick an initial value 𝑤 (
𝜕𝐿
Ø Compute |%)%!
𝜕𝑤
Loss 𝜕𝐿
!
𝑤 ←𝑤 −𝜂 ( |%)%!
𝐿 𝜕𝑤
𝜕𝐿
𝜂 |%)%! 𝜂: learning rate
𝜕𝑤
hyperparameters
𝑤( 𝑤! 𝑤
Source of image: http://chico386.pixnet.net/album/photo/171572850
Gradient Descent
Ø (Randomly) Pick an initial value 𝑤 (
𝜕𝐿
Ø Compute |%)%!
𝜕𝑤
Loss 𝜕𝐿
!
𝑤 ←𝑤 −𝜂 ( |%)%!
𝐿 𝜕𝑤
Ø Update 𝑤 iteratively
Does local minima truly cause the problem?
Local global
minima minima
𝑤( 𝑤! 𝑤" 𝑤* 𝑤
3. Optimization 𝑤 ∗ , 𝑏 ∗ = 𝑎𝑟𝑔 min 𝐿
%,'
𝜕𝐿 𝜕𝐿
|%)%! ,')'! 𝑤! ← 𝑤( −𝜂 |%)%! ,')'!
𝜕𝑤 𝜕𝑤
𝜕𝐿 𝜕𝐿
|%)%! ,')'! 𝑏! ← 𝑏( − 𝜂 |%)%! ,')'!
𝜕𝑏 𝜕𝑏
𝑤 ∗ = 0.97, 𝑏 ∗ = 0.1𝑘
𝑏 𝐿 𝑤 ∗ , 𝑏 ∗ = 0.48𝑘
(−𝜂 𝜕𝐿⁄𝜕𝑤, −𝜂 𝜕𝐿⁄𝜕𝑏)
𝑤
Machine Learning is so simple ……
𝑤 ∗ = 0.97, 𝑏 ∗ = 0.1𝑘
𝑦 = 𝑏 + 𝑤𝑥! 𝐿 𝑤 ∗ , 𝑏 ∗ = 0.48𝑘
Step 1: Step 2: define
Step 3:
function with loss from
optimization
unknown training data
Machine Learning is so simple ……
𝑤 ∗ = 0.97, 𝑏 ∗ = 0.1𝑘
𝑦 = 𝑏 + 𝑤𝑥! 𝐿 𝑤 ∗ , 𝑏 ∗ = 0.48𝑘
Step 1: Step 2: define
Step 3:
function with loss from
optimization
unknown training data
Training
Views
(k)
2021/01/01 2021/02/14
2017 - 2020 2021
𝑦 = 𝑏 + 𝑤𝑥!
𝐿 = 0.48𝑘 𝐿′ = 0.58𝑘
'
2017 - 2020 2021
𝑦 = 𝑏 + , 𝑤# 𝑥#
𝐿 = 0.38𝑘 𝐿′ = 0.49𝑘
#$!
𝒃 𝒘∗𝟏 𝒘∗𝟐 𝒘∗𝟑 𝒘∗𝟒 𝒘∗𝟓 𝒘∗𝟔 𝒘∗𝟕
0.05k 0.79 -0.31 0.12 -0.01 -0.10 0.30 0.18
()
2017 - 2020 2021
𝑦 = 𝑏 + , 𝑤# 𝑥# 𝐿′ = 0.46𝑘
𝐿 = 0.33𝑘
#$!
%&
2017 - 2020 2021
𝑦 = 𝑏 + , 𝑤# 𝑥#
𝐿 = 0.32𝑘 𝐿′ = 0.46𝑘
#$!
Linear models
Linear models are too simple … we need more sophisticated modes.
Different w
Different 𝑏
𝑥!
0
𝑥!
2
All Piecewise Linear Curves
= constant + sum of a set of
𝑥!
To have good approximation, we need sufficient pieces.
red curve = constant + sum of a set of
How to represent
this function? Hard Sigmoid
𝑥!
Sigmoid Function
1
𝑦=𝑐
1 + 𝑒 2 '3%4"
= 𝑐 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏 + 𝑤𝑥!
𝑥!
Different 𝑤
Change slopes
Different b
Shift
Different 𝑐
Change height
red curve = sum of a set of + constant
𝑦
𝑐! 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏! + 𝑤!𝑥!
1
𝑐5 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏5 + 𝑤5𝑥! 3
0
𝑥!
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + 𝑤* 𝑥!
0 * 1 + 2 + 3
𝑐" 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏" + 𝑤"𝑥! 2
New Model: More Features
𝑦 = 𝑏 + 𝑤𝑥!
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + 𝑤* 𝑥!
*
𝑦 = 𝑏 + , 𝑤# 𝑥#
#
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥#
* #
𝑗: 1,2,3
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥# no. of features
* # 𝑖: 1,2,3
no. of sigmoid
1
𝑟! = 𝑏! + 𝑤!!𝑥! + 𝑤!"𝑥" + 𝑤!5𝑥5 + 𝑤!!
𝑏! 𝑤!" 𝑥!
𝑤67 : weight for 𝑥7 for i-th sigmoid 1
𝑤!5
2 𝑥"
𝑟" = 𝑏" + 𝑤"!𝑥! + 𝑤""𝑥" + 𝑤"5𝑥5 +
1 𝑥5
3
𝑟5 = 𝑏5 + 𝑤5!𝑥! + 𝑤5"𝑥" + 𝑤55𝑥5 +
1
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥# 𝑖: 1,2,3
* # 𝑗: 1,2,3
𝒓 = 𝒃 + 𝑊 𝒙
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥# 𝑖: 1,2,3
* # 𝑗: 1,2,3
1
𝑟! + 𝑤!!
𝑏! 𝑤!" 𝑥!
1
𝑤!5
2 𝑥"
𝒓 = 𝒃 + 𝑊 𝒙 𝑟" +
1 𝑥5
3
𝑟5 +
1
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥# 𝑖: 1,2,3
* # 𝑗: 1,2,3
1
𝑎! 𝑟! + 𝑤!!
1 𝑏! 𝑤!" 𝑥!
𝑎! = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑟! = 1
𝑤!5
1 + 𝑒 28"
2 𝑥"
𝑎" 𝑟" +
1 𝑥5
3
𝒂 =𝜎 𝒓 𝑎5 𝑟5 +
1
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥# 𝑖: 1,2,3
* # 𝑗: 1,2,3
1
𝑎! 𝑟! + 𝑤!!
𝑐! 𝑏! 𝑤!" 𝑥!
1
𝑤!5
𝑐" 2 𝑥"
𝑦 + 𝑎" 𝑟" +
𝑏
1 1 𝑥5
𝑐5
3
𝑎5 𝑟5 +
1
𝑦 = 𝑏 + 𝒄 , 𝒂
1
𝑎! 𝑟! + 𝑤!!
𝑐! 𝑏! 𝑤!" 𝑥!
1
𝑤!5
𝑐" 2 𝑥"
𝑦 + 𝑎" 𝑟" +
𝑏
1 1 𝑥5
𝑐5
3
𝑎5 𝑟5 +
1
𝑦 = 𝑏 + 𝒄, 𝒂
𝒂 =𝜎 𝒓 𝒓 = 𝒃 + 𝑊 𝒙
1
𝑎! 𝑟! + 𝑤!!
𝑐! 𝑏! 𝑤!" 𝑥!
1
𝑤!5
𝑐" 2 𝑥"
𝑦 + 𝑎" 𝑟" +
𝑏
1 1 𝑥5
𝑐5
3
𝑎5 𝑟5 +
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
Function with unknown parameters
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
𝒙 feature Rows
of 𝑊
……
𝜃!
Unknown parameters
𝜃(
𝜽 =
𝜃+
𝑊 𝒃 ⋮
𝒄, 𝑏
Back to ML Framework
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
Loss Ø Loss is a function of parameters 𝐿 𝜃
Ø Loss means how good a set of values is.
feature
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
𝑒
label 𝑦
D
Given a set of values
1
Loss: 𝐿 = , 𝑒"
𝑁
"
Back to ML Framework
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
Optimization of New Model 𝜃!
𝜽∗ = 𝑎𝑟𝑔 min 𝐿 𝜽 = 𝜃(
𝜽 𝜃+
⋮
Ø (Randomly) Pick initial values 𝜽(
𝜕𝐿 𝜕𝐿
|𝜽$𝜽! 𝜂 |𝜽$𝜽!
𝜕𝜃! 𝜃!! 𝜃!- 𝜕𝜃!
𝒈 = 𝜕𝐿 𝜃(! ← 𝜃(- − 𝜕𝐿
gradient 𝜕𝜃 |𝜽$𝜽! 𝜂 |𝜽$𝜽!
( ⋮ ⋮ 𝜕𝜃(
⋮ ⋮
𝒈 = ∇𝐿 𝜽- 𝜽! ← 𝜽- − 𝜂𝒈
Optimization of New Model
𝜽∗ = 𝑎𝑟𝑔 min 𝐿
𝜽
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
𝑥!
Rectified Linear
Unit (ReLU) 𝑐 𝑚𝑎𝑥 0, 𝑏 + 𝑤𝑥!
𝑥!
𝑐′ 𝑚𝑎𝑥 0, 𝑏′ + 𝑤′𝑥!
Sigmoid → ReLU
𝑦 = 𝑏 + , 𝑐* 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏* + , 𝑤*# 𝑥#
* #
Activation function
𝑦 = 𝑏 + , 𝑐* 𝑚𝑎𝑥 0, 𝑏* + , 𝑤*# 𝑥#
(* #
𝑦 = 𝑏 + , 𝑐* 𝑚𝑎𝑥 0, 𝑏* + , 𝑤*# 𝑥#
(* #
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
𝑥"
+ 𝑎" +
1 or 1 𝑥5
……
+ 𝑎5 +
1 1
𝒂′ = 𝜎 𝒃′ + 𝑊′ 𝒂 𝒂 =𝜎 𝒃 + 𝑊 𝒙
Experimental Results
• Loss for multiple hidden layers
• 100 ReLU for each layer
• input features are the no. of views in the past 56
days
1 layer 2 layer 3 layer 4 layer
2017 – 2020 0.28k 0.18k 0.14k 0.10k
2021 0.43k 0.39k 0.38k 0.44k
Red: real no. of views
3 layers blue: estimated no. of views
Views
(k)
2021/01/01 2021/02/14
Back to ML Framework
𝑦 = 𝑏 + 𝒄, 𝜎 𝒃 + 𝑊 𝒙
𝑥"
+ 𝑎" +
1 1 𝑥5
……
+ 𝑎5 +
1 Neuron 1
http://cs231n.stanford.e
du/slides/winter1516_le 19 layers
cture8.pdf
8 layers
6.7%
7.3%
16.4%
Special
structure
7.3% 6.7%
16.4%
AlexNet VGG GoogleNet Residual Net Taipei
(2012) (2014) (2014) (2015) 101
Why don’t we go deeper?
• Loss for multiple hidden layers
• 100 ReLU for each layer
• input features are the no. of views in the past 56
days
1 layer 2 layer 3 layer 4 layer
2017 – 2020 0.28k 0.18k 0.14k 0.10k
2021 0.43k 0.39k 0.38k 0.44k
Why don’t we go deeper?
• Loss for multiple hidden layers
• 100 ReLU for each layer
• input features are the no. of views in the past 56
days
1 layer 2 layer 3 layer 4 layer
2017 – 2020 0.28k 0.18k 0.14k 0.10k
2021 0.43k 0.39k 0.38k 0.44k
https://youtu.be/Dr-WRlEFefw https://youtu.be/ibJpTrp5mcE