Lecture 8 - Intro To Neural Networks
Lecture 8 - Intro To Neural Networks
Lecture 8:
Intro to Neural Networks
19 March 2023
1
Recap Logistic Regression / SVM
With x as features
• Overfitting
Logistic Regression / SVM
• Regularization With 𝝓(𝒙) as features
• Linear and logistic regression
• Support Vector Machine (SVM)
• Hard-margin SVM SVM with Kernel Trick
With 𝝓(x) mapping to
• Soft-margin SVM finite-dimensional
• Kernel features
3
Outline
• Perceptron
• Biological inspiration
• Perceptron Learning Algorithm
• Neural Networks
• Single-layer Neural Networks
• Multi-layer Neural Networks
• Regression and Classification
• Neural Networks with Gradient Descent
• Neural Networks vs Other Models
4
Image credit:
https://kids.frontiersin.org/articles/10.3389/frym.2021.560631
https://www.upgrad.com/blog/biological-neural-network/
5
Neuron
𝑥2
𝑤2 𝑥1
𝑤1
𝑥3
𝑤3 ∑ 𝑦ො
𝑤4 ∫
𝑥4
6
Frank Rossenblatt (1943)
Perceptron
1
𝑥1
𝑤0
𝑥2 𝑤1 𝑛
𝑤2 ∑ ∫ 𝑦ො 𝑦ො = ℎ𝑤 𝑥 = 𝑔( 𝑤𝑖 𝑥𝑖 )
𝑥3 𝑤3 𝑖=0
Sum Activation Output
𝑥4 𝑤4
Inputs Weights
Sign function
+1 if 𝑧 ≥ 0
𝑔 𝑧 =ቊ
−1 if z < 0
7
Frank Rossenblatt (1943)
𝑛
Perceptron: An Example 𝑦ො = ℎ𝑤 𝑥 = 𝑔( 𝑤𝑖 𝑥𝑖 )
𝑖=0
1
𝑥1
𝑤0
𝑥2 𝑤1
𝑤2 ∑ ∫ 𝑦ො
𝑥3 𝑤3
Sum Activation Output Cancer prediction: benign (-1), malignant (1)
𝑥4 𝑤4
Inputs Weights
Sign function
+1 if 𝑧 ≥ 0
𝑔 𝑧 =ቊ 1 2 3 4 5 6
−1 if z < 0
Size (𝑐𝑚2 )
8
Frank Rossenblatt (1943)
𝑛
Perceptron: An Example 𝑦ො = ℎ𝑤 𝑥 = 𝑔( 𝑤𝑖 𝑥𝑖 )
𝑖=0
1
+1 if 𝑤0 + 𝑤1 𝑥 ≥ 0
ℎ𝑤 𝑥 = ቊ
𝑤0 −1 otherwise
+1 if − 3.5 + 1𝑥 ≥ 0
𝑤1 ℎ𝑤 𝑥 = ቊ
𝑥1 ∑ ∫ 𝑦ො −1 otherwise
Size (𝑐𝑚2 )
Sum Activation Output Cancer prediction: benign (-1), malignant (1)
Inputs Weights
Sign function
+1 if 𝑧 ≥ 0
𝑔 𝑧 =ቊ 1 2 3 4 5 6
−1 if z < 0
Size (𝑐𝑚2 )
9
Frank Rossenblatt (1943)
𝑛
Perceptron: An Example 𝑦ො = ℎ𝑤 𝑥 = 𝑔( 𝑤𝑖 𝑥𝑖 )
𝑖=0
1
+1 if 𝑤0 + 𝑤1 𝑥 ≥ 0
ℎ𝑤 𝑥 = ቊ
𝑤0 −1 otherwise
+1 if − 3.5 + 1𝑥 ≥ 0
𝑤1 ℎ𝑤 𝑥 = ቊ
𝑥1 ∑ ∫ 𝑦ො −1 otherwise
Size (𝑐𝑚2 )
Sum Activation Output Cancer prediction: benign (-1), malignant (1)
Inputs Weights +1
Sign function Prediction
+1 if 𝑧 ≥ 0 -1
𝑔 𝑧 =ቊ 1 2 3 4 5 6
−1 if z < 0
Size (𝑐𝑚2 )
Learning rate
11
Frank Rossenblatt (1943)
𝑦ො = ℎ𝑤 𝑥 = 𝑔 𝑤0 +𝑤1 𝑥1 + 𝑤2 𝑥2
= 𝑔(0 + 1𝑥1 + 0.5𝑥2 )
𝑥1
𝑦ො = ℎ𝑤 𝑥 = 𝑔 𝑤0 +𝑤1 𝑥1 + 𝑤2 𝑥2
= 𝑔(−0.2 + 0.6𝑥1 + 0.9𝑥2 )
𝑥1
𝑦ො = ℎ𝑤 𝑥 = 𝑔 𝑤0 +𝑤1 𝑥1 + 𝑤2 𝑥2
= 𝑔(−0.4 + 1𝑥1 + 0.5𝑥2 )
𝑥1
𝑦ො = ℎ𝑤 𝑥 = 𝑔 𝑤0 +𝑤1 𝑥1 + 𝑤2 𝑥2
= 𝑔(−0.6 + 0.6𝑥1 + 0.9𝑥2 )
𝑥1
𝑦ො = ℎ𝑤 𝑥 = 𝑔 𝑤0 +𝑤1 𝑥1 + 𝑤2 𝑥2
= 𝑔(−0.8 + 1𝑥1 + 0.5𝑥2 )
𝑥1
𝑦ො = ℎ𝑤 𝑥 = 𝑔 𝑤0 +𝑤1 𝑥1 + 𝑤2 𝑥2
= 𝑔(−1 + 0.6𝑥1 + 0.9𝑥2 )
𝑥1
No misclassifications! Converged!
17
Frank Rossenblatt (1943)
𝑦ො = −1 𝑦ො = +1 𝑤
Doesn’t affect 𝑤′ 𝑥 Doesn’t affect 𝜃
the sign 𝑤𝑇𝑥 < 0 𝜃′ 𝑤 the sign 𝑤𝑇𝑥 > 0
𝜃 𝑤⋅𝑥 >0 𝜃′ −𝑥
𝑤⋅𝑥 <0
𝑤 . 𝑥 cos 𝜃 < 0 𝑥 𝑤 . 𝑥 cos 𝜃 > 0 𝑤′
cos 𝜃 < 0 cos 𝜃 > 0
𝑤′ ← 𝑤 + 𝑥 𝑤′ ← 𝑤 − 𝑥
𝜋 𝜋
<𝜃<𝜋 𝜃 ′ will be smaller 0<𝜃< 𝜃 ′ will be larger
2 (as required) 2 (as required)
What we want: 𝑦 = +1 What we want: 𝑦 = −1
cos 𝜃 > 0 cos 𝜃 < 0
𝜋 𝜋
0<𝜃< <𝜃<𝜋 18
2 2
Frank Rossenblatt (1943)
𝑦ො = −1 𝑦ො = +1 𝑤
𝑤′ 𝑥
𝑦 = +1 𝑦 = −1 𝜃
𝜃′ 𝑤 −𝑥
𝜃 𝜃′
𝑥 𝑤′
𝑤′ ← 𝑤 + 𝑥 𝑤′ ← 𝑤 − 𝑥
𝜃 ′ will be smaller 𝜃 ′ will be larger
(as required) (as required)
𝑤 ← 𝑤 + 𝛾 𝑦 − 𝑦ො 𝑥 𝑤 ← 𝑤 + 𝛾 𝑦 − 𝑦ො 𝑥
← 𝑤 + 𝛾 +1 − (−1) 𝑥 ← 𝑤 + 𝛾 −1 + (−1) 𝑥
← 𝑤 + 2𝛾𝑥 ← 𝑤 − 2𝛾𝑥 19
Outline
• Perceptron
• Biological inspiration
• Perceptron Learning Algorithm
• Neural Networks
• Single-layer Neural Networks
• Multi-layer Neural Networks
• Regression and Classification
• Neural Networks with Gradient Descent
• Neural Networks vs Other Models
20
Frank Rossenblatt (1943)
Perceptron
1
𝑥1
𝑤0
𝑥2 𝑤1 𝑛
𝑤2 ∑ ∫ 𝑦ො 𝑦ො = ℎ𝑤 𝑥 = 𝑔( 𝑤𝑖 𝑥𝑖 )
𝑥3 𝑤3 𝑖=0
Sum Activation Output
𝑥4 𝑤4
Inputs Weights
Sign function
+1 if 𝑧 ≥ 0
𝑔 𝑧 =ቊ
−1 if z < 0
21
Single-layer Neural Networks
1
𝑥1
𝑤0
𝑥2 𝑤1 𝑛
𝑤2 ∑ ∫ 𝑦ො 𝑦ො = ℎ𝑤 𝑥 = 𝑔( 𝑤𝑖 𝑥𝑖 )
𝑥3 𝑤3 𝑖=0
Sum Activation Output
𝑥4 𝑤4
Inputs Weights
Activation Function
(usually non-linear)
22
Single-layer Neural Networks
1
𝑥1
𝑤0
𝑥2 𝑤1 𝑛
𝑤2 ∑ ∫ 𝑦ො 𝑦ො = ℎ𝑤 𝑥 = 𝑔( 𝑤𝑖 𝑥𝑖 )
𝑥3 𝑤3 Output 𝑖=0
𝑥4 𝑤4
Inputs Weights
Activation Function
(usually non-linear)
23
Single-layer Neural Networks
1 𝑦ො = ℎ𝑤 𝑥 = 𝜎 ∑𝑛𝑖=0 𝑤𝑖 𝑥𝑖
𝑥1
𝑤0
𝑥2 𝑤1
𝑤2 ∑ ∫ 𝑦ො
𝑥3 𝑤3 Output
𝑥4 𝑤4
Sigmoid
Inputs Weights 1
𝜎 𝑧 =
1 + 𝑒 −𝑧
24
Single-layer Neural Networks: NOT
1 𝑦ො = ℎ𝑤 𝑥 = 𝜎 ∑𝑛𝑖=0 𝑤𝑖 𝑥𝑖
𝑤0 = 10
Consider 𝑥1 ∈ 1,0
𝑥1 ∑ ∫ 𝑦ො
𝑤1= −20 𝒙𝟏 𝒚 ∑ ෝ
𝒚
Output
0 1 10 0.999
1 0 −10 0.00004
Sigmoid
Inputs Weights 1
𝜎 𝑧 =
1 + 𝑒 −𝑧
25
Single-layer Neural Networks: AND
1 𝑦ො = ℎ𝑤 𝑥 = 𝜎 ∑𝑛𝑖=0 𝑤𝑖 𝑥𝑖
𝑤0 = −30
𝑥1 𝑤1 = 20
Consider 𝑥1 , 𝑥2 ∈ 1,0
∑ ∫ 𝑦ො
𝒙𝟏 𝒙𝟐 𝒚 ∑ ෝ
𝒚
𝑥2 Output
𝑤2 = 20 0 0 0 −30 0.000 …
0 1 0 −10 0.00004
Sigmoid
Inputs Weights 1 1 0 0 −10 0.00004
𝜎 𝑧 =
1 + 𝑒 −𝑧 1 1 1 10 0.999
26
Single-layer Neural Networks: OR
1 𝑦ො = ℎ𝑤 𝑥 = 𝜎 ∑𝑛𝑖=0 𝑤𝑖 𝑥𝑖
𝑤0 = −10
𝑥1 𝑤1 = 20
Consider 𝑥1 , 𝑥2 ∈ 1,0
∑ ∫ 𝑦ො
𝒙𝟏 𝒙𝟐 𝒚 ∑ ෝ
𝒚
𝑥2 Output
𝑤2 = 20 0 0 0 −10 0.00004
0 1 1 10 0.999
Sigmoid
Inputs Weights 1 1 0 1 10 0.999
𝜎 𝑧 =
1 + 𝑒 −𝑧 1 1 1 30 0.999
27
Single-layer Neural Networks: NOR (NOT 𝑥1 ) AND (NOT 𝑥2 )
1 𝑦ො = ℎ𝑤 𝑥 = 𝜎 ∑𝑛𝑖=0 𝑤𝑖 𝑥𝑖
𝑤0 = 10
𝑥1 𝑤1 = −20
Consider 𝑥1 , 𝑥2 ∈ 1,0
∑ ∫ 𝑦ො
𝒙𝟏 𝒙𝟐 𝒚 ∑ ෝ
𝒚
𝑥2 Output
𝑤2 = −20 0 0 1 10 0.999
0 1 0 −10 0.00004
Sigmoid
Inputs Weights 1 1 0 0 −10 0.00004
𝜎 𝑧 =
1 + 𝑒 −𝑧 1 1 0 −30 0.000 …
28
Single-layer Neural Networks: XNOR eXclusive Not OR
1 𝑦ො = ℎ𝑤 𝑥 = 𝜎 ∑𝑛𝑖=0 𝑤𝑖 𝑥𝑖
𝑤0 = ?
𝑥1 𝑤1 = ?
Consider 𝑥1 , 𝑥2 ∈ 1,0
∑ ∫ 𝑦ො
𝒙𝟏 𝒙𝟐 𝒚
𝑥2 Output
𝑤2 = ? 0 0 1
0 1 0
Sigmoid
Inputs Weights 1 1 0 0
𝜎 𝑧 =
1 + 𝑒 −𝑧 1 1 1
𝑥2
𝑥1 29
Single-layer Neural Networks: XNOR eXclusive Not OR
1 1 1 𝑦ො = ℎ𝑤 𝑥 = 𝜎 ∑𝑛𝑖=0 𝑤𝑖 𝑥𝑖
20 −30 20 −10 −20 10
𝑥1 𝑥1 𝑥1
AND 𝑦ො OR 𝑦ො NOR 𝑦ො
𝑥2 20
𝑥2 20 𝑥2 −20
Consider 𝑥1 , 𝑥2 ∈ 1,0
𝒙𝟏 𝒙𝟐 𝒚
0 0 1
0 1 0
1 0 0
1 1 1
𝑥2
𝑥1 30
Multi
Single-layer Neural Networks: XNOR eXclusive Not OR
𝑦ො = ℎ𝑤 𝑥 = 𝜎 ∑𝑛𝑖=0 𝑤𝑖 𝑥𝑖
1
−30
20 Consider 𝑥1 , 𝑥2 ∈ 1,0
𝑥1
AND 𝑎1(2) 1
𝒙𝟏 𝒙𝟐 𝒚
20 −10
20
(3) 0 0 1
1 𝑎1
OR 𝑦ො 0 1 0
10
−20 20
NOR 𝑎(2)
1 0 0
𝑥2 2
−20 1 1 1
𝑥2
𝑥1 31
Multi-layer Neural Networks: XNOR eXclusive Not OR
𝑥2
Input Hidden Output
Layer Layer Layer
Not linearly separable!
𝑥1 32
Multi-layer Neural Networks: |𝑥 − 1|
1 1
1
0
−1
1
+1 +1
[1] [2] 𝑦ො
𝑥 𝑎1 𝑎1
+1
𝑔[2] = None
−1 +1 max 0, 𝑥 − 1 + max(0, 1 − 𝑥)
[1]
𝑎2
𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1
⋮ ⋮ ⋮
⋮ 1 1 1
⋮ ⋮ ⋮
⋮ 1 1 1
𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1
34
Multi-layer Neural Networks
1 1 1
𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1
⋮ ⋮ ⋮
⋮ 1 1 1
⋮ ⋮ ⋮
⋮ 1 1 1
𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1
35
Multi-layer Neural Networks
1 1 1
𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1
⋮ ⋮ ⋮
⋮ 1 1 1
⋮ ⋮ ⋮
⋮ 1 1 1
𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1
36
Multi-layer Neural Networks
1 1 1
𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1
⋮ ⋮ ⋮
⋮ 1 1 1
⋮ ⋮ ⋮
⋮ 1 1 1
𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1
37
Multi-layer Neural Networks
1 1 1
𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1
⋮ ⋮ ⋮
⋮ 1 1 1
⋮ ⋮ ⋮
⋮ 1 1 1
𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1
38
Multi-layer Neural Networks
1 1 1
𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1
⋮ ⋮ ⋮
⋮ 1 1 1
⋮ ⋮ ⋮
⋮ 1 1 1
𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1
39
Multi-layer Neural Networks
1 1 1
𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1
⋮ ⋮ ⋮
⋮ 1 1 1
⋮ ⋮ ⋮
⋮ 1 1 1
𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1
40
Multi-layer Neural Networks
1 1 1
𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1
⋮ ⋮ ⋮
⋮ 1 1 1
⋮ ⋮ ⋮
⋮ 1 1 1
𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1
41
Multi-layer Neural Networks
1 1 1
𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1
⋮ ⋮ ⋮
⋮ 1 1 1
⋮ ⋮ ⋮
⋮ 1 1 1
𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1
42
Multi-layer Neural Networks
1 1 1
𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1 Universal Function Approximation Theorem
⋮ ⋮ ⋮ Neural networks can represent a wide variety of
⋮ 1 1 1 interesting functions with appropriate weights
𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1
43
Neural Networks and Matrix Multiplication (1)
𝑥1
𝑊11
𝑊21
𝑦ො1
𝑊31
𝑥2
𝑊12
𝑊22
𝑦ො2
𝑊32
𝑥3
# Input (number of weights per neuron / input variables)
𝑇
𝑥1 𝑊11 𝑊12 𝑊11 𝑊12 𝑥1 𝑥1
𝑊11 𝑊21 𝑊31 𝑦ො1
𝒙 = 𝑥2 𝑊 = 𝑊21 𝑊22 ෝ = 𝑔 𝑊𝑇𝒙 = 𝑔
𝒚 𝑊21 𝑊22 𝑥2 =𝑔 𝑥 =
𝑥3 𝑊31 𝑊32 𝑥3 𝑊12 𝑊22 𝑊32 𝑥2 𝑦ො2
𝑊31 𝑊32 3
[2] [2]
𝑦ො2 𝑊 [2] = 𝑊21 𝑊22
[2] [2] [2]
𝑊32 𝑊31 𝑊32
𝑥3
⋮ ⋮ ⋮ ⋮ 𝑦ො1
𝑥𝑗 𝑥1 𝑥1 … ⋮
⋮ ⋮ ⋮ ⋮ 𝑦ො𝐶
𝑥𝑛 𝑥2 𝑥2 …
Forward Propagation
𝑦ො1
ෝ=𝑔
𝒚 𝐿 𝑊 𝐿 𝑇 …𝑔 𝐿−1 𝑊 𝐿−1 𝑇 …𝑔 𝑙 𝑊 𝑙 𝑇 …𝑔 2 𝑊 2 𝑇𝑔 1 𝑊 1 𝑇𝒙 = …
𝑦ො𝐶
46
Regression and Classification
1 1
𝑥1 … 𝑥1 …
⋮ ⋮
⋮ 1 1 ⋮ 1 1
𝑥𝑗 … 𝑦ො ∈ ℝ 𝑥𝑗 … 𝑦ො ∈ [0,1]
⋮ ⋮
⋮ 1 ⋮ 1
Linear/no Sigmoid
𝑥𝑛 … activation 𝑥𝑛 … activation
𝑔 𝑥 =𝑥
𝑥1 … 𝑥1 …
𝑦ො1 ∈ [0,1]
⋮ ⋮ ⋮
⋮ 1 1 ⋮ 1 1
𝑥𝑗 … 𝑦ො ∈ ℝ 𝑥𝑗 … 𝑦ො𝑖 ∈ [0,1]
⋮ ⋮ ⋮
⋮ 1 ⋮ 1 1
Linear/no
𝑥𝑛 … activation 𝑥𝑛 … 𝑦ො𝐶 ∈ [0,1]
𝑔 𝑥 =𝑥
𝐶
49
Background: Chain Rule
𝑧 𝑥 =ℎ 𝑔 𝑓 𝑥
𝑑𝑧 𝑑𝑧 𝑑ℎ 𝑑𝑔 𝑑𝑓
=
𝑑𝑥 𝑑ℎ 𝑑𝑔 𝑑𝑓 𝑑𝑥
50
Aside: Derivative of Sigmoid Function
1
1 𝑑 1 + 𝑒 −𝑥
𝜎 𝑥 = 𝜎′ 𝑥 =
1 + 𝑒 −𝑥 𝑑𝑥
𝑑 1 + 𝑒 −𝑥 −1
=
𝑑𝑥
= − 1 + 𝑒 −𝑥 −2 (−𝑒 −𝑥 )
𝑒 −𝑥
=
1 + 𝑒 −𝑥 2
1 𝑒 −𝑥
= .
1 + 𝑒 −𝑥 1 + 𝑒 −𝑥
1 1
= 1 −
1 + 𝑒 −𝑥 1 + 𝑒 −𝑥
=𝜎 𝑥 1−𝜎 𝑥
51
Gradient Descent on Single-layer NN
𝑛
1 𝑑𝐿
𝑦ො = 𝜎 𝑓 𝑥 𝑓 𝑥 = 𝑤𝑖 𝑥𝑖 𝐿 = 𝑦ො − 𝑦 2 𝑤𝑖 ← 𝑤𝑖 − 𝛾
2 𝑑𝑤𝑖
𝑖=0
Mean-squared Error 𝑤𝑖 ← 𝑤𝑖 − 𝛾 𝑦ො − 𝑦 𝑦ො 1 − 𝑦ො 𝑥𝑖
1 2
𝑑𝐿 𝑑𝐿 𝑑𝑦ො 𝑑𝑓 𝑑𝐿 𝑑(2 𝑦ො − 𝑦 )
= = = 𝑦ො − 𝑦
𝑑𝑤𝑖 𝑑𝑦ො 𝑑𝑓 𝑑𝑤𝑖 𝑑 𝑦ො 𝑑𝑦ො
𝑑 𝑦ො 𝑑(𝜎 𝑓 𝑥 )
= 𝑦ො − 𝑦 𝑦(1
ො − 𝑦)𝑥
ො 𝑖 = =𝜎 𝑓 𝑥 1−𝜎 𝑓 𝑥 = 𝑦(1
ො − 𝑦)
ො
𝑑𝑓 𝑓
0
1 1 𝑑𝑓 𝑑(∑𝑖=0 𝑤𝑖 𝑥𝑖 ) 𝑑 𝑤𝑖 𝑥𝑖 𝑑 ∑𝑘≠𝑖 𝑤𝑘 𝑥𝑘
= = + = 𝑥𝑖
20 −30 20 −10 𝑑𝑤𝑖 𝑑𝑤𝑖 𝑑𝑤𝑖 𝑑𝑤𝑖
𝑥1 𝑥1
AND 𝑦ො OR 𝑦ො
𝑥2 20
𝑥2 20
52
Gradient Descent on Multi-layer NN
Backpropagation is covered in the next Lecture!
53
Outline
• Perceptron
• Biological inspiration
• Perceptron Learning Algorithm
• Neural Networks
• Single-layer Neural Networks
• Multi-layer Neural Networks
• Regression and Classification
• Neural Networks with Gradient Descent
• Neural Networks vs Other Models
54
Neural Networks vs Logistic Regression
𝑥1
Sigmoid
⋮
𝑥𝑗
𝑥𝑛
𝑥1
𝜙
Sigmoid
⋮ ⋮
𝑥𝑗
𝜙
⋮ ⋮
𝑥𝑛
𝜙
Non-linear, non-robust decision boundary
Prone to misclassification since the decision
Handcrafted Feature Mapping Linear Model boundary can be too close to data points56
Neural Networks vs Support Vector Machines
Implicit with Kernel: can have many, possibly infinite dimensional features
𝑥1
𝜙
Sign
⋮ ⋮
𝑥𝑗
𝜙
⋮ ⋮
𝑥𝑛
𝜙
Non-linear, robust decision boundary
Decision boundary is guaranteed to be far
Handcrafted Feature Mapping Linear Model from the data points 57
Neural Networks vs Other Methods
1 1
𝑥𝑗 …
⋮ ⋮
1 1
⋮
𝑥𝑛 …
60
To Do
• Lecture Training 8
• +100 Free EXP
• +50 Early bird bonus
• PS6 is out today!
• May need some knowledge from the next lecture (Lecture 9)
61