0% found this document useful (0 votes)
6 views61 pages

Lecture 8 - Intro To Neural Networks

The document is a lecture outline for CS2109S on Neural Networks, covering topics such as perceptrons, their learning algorithms, and comparisons with other models. It discusses overfitting, regularization, and the kernel trick in SVMs, as well as the structure and function of neural networks. The content includes examples and illustrations to explain the concepts effectively.

Uploaded by

Runjia Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views61 pages

Lecture 8 - Intro To Neural Networks

The document is a lecture outline for CS2109S on Neural Networks, covering topics such as perceptrons, their learning algorithms, and comparisons with other models. It discusses overfitting, regularization, and the kernel trick in SVMs, as well as the structure and function of neural networks. The content includes examples and illustrations to explain the concepts effectively.

Uploaded by

Runjia Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

CS2109S: Introduction to AI and Machine Learning

Lecture 8:
Intro to Neural Networks
19 March 2023

1
Recap Logistic Regression / SVM
With x as features

• Overfitting
Logistic Regression / SVM
• Regularization With 𝝓(𝒙) as features
• Linear and logistic regression
• Support Vector Machine (SVM)
• Hard-margin SVM SVM with Kernel Trick
With 𝝓(x) mapping to
• Soft-margin SVM finite-dimensional
• Kernel features

• SVM with Kernel Trick


SVM with Kernel Trick
With 𝝓(x) mapping to
infinite-dimensional
features
2
Outline
• Perceptron
• Biological inspiration
• Perceptron Learning Algorithm
• Neural Networks
• Single-layer Neural Networks
• Multi-layer Neural Networks
• Regression and Classification
• Neural Networks with Gradient Descent
• Neural Networks vs Other Models

3
Outline
• Perceptron
• Biological inspiration
• Perceptron Learning Algorithm
• Neural Networks
• Single-layer Neural Networks
• Multi-layer Neural Networks
• Regression and Classification
• Neural Networks with Gradient Descent
• Neural Networks vs Other Models

4
Image credit:
https://kids.frontiersin.org/articles/10.3389/frym.2021.560631
https://www.upgrad.com/blog/biological-neural-network/

Brain and Neuron

5
Neuron
𝑥2

𝑤2 𝑥1
𝑤1
𝑥3
𝑤3 ∑ 𝑦ො

𝑤4 ∫

𝑥4

6
Frank Rossenblatt (1943)

Perceptron
1
𝑥1
𝑤0
𝑥2 𝑤1 𝑛
𝑤2 ∑ ∫ 𝑦ො 𝑦ො = ℎ𝑤 𝑥 = 𝑔(෍ 𝑤𝑖 𝑥𝑖 )
𝑥3 𝑤3 𝑖=0
Sum Activation Output
𝑥4 𝑤4

Inputs Weights

Sign function
+1 if 𝑧 ≥ 0
𝑔 𝑧 =ቊ
−1 if z < 0

7
Frank Rossenblatt (1943)
𝑛
Perceptron: An Example 𝑦ො = ℎ𝑤 𝑥 = 𝑔(෍ 𝑤𝑖 𝑥𝑖 )
𝑖=0

1
𝑥1
𝑤0
𝑥2 𝑤1
𝑤2 ∑ ∫ 𝑦ො
𝑥3 𝑤3
Sum Activation Output Cancer prediction: benign (-1), malignant (1)
𝑥4 𝑤4

Inputs Weights

Sign function
+1 if 𝑧 ≥ 0
𝑔 𝑧 =ቊ 1 2 3 4 5 6
−1 if z < 0
Size (𝑐𝑚2 )

8
Frank Rossenblatt (1943)
𝑛
Perceptron: An Example 𝑦ො = ℎ𝑤 𝑥 = 𝑔(෍ 𝑤𝑖 𝑥𝑖 )
𝑖=0

1
+1 if 𝑤0 + 𝑤1 𝑥 ≥ 0
ℎ𝑤 𝑥 = ቊ
𝑤0 −1 otherwise

+1 if − 3.5 + 1𝑥 ≥ 0
𝑤1 ℎ𝑤 𝑥 = ቊ
𝑥1 ∑ ∫ 𝑦ො −1 otherwise
Size (𝑐𝑚2 )
Sum Activation Output Cancer prediction: benign (-1), malignant (1)

Inputs Weights

Sign function
+1 if 𝑧 ≥ 0
𝑔 𝑧 =ቊ 1 2 3 4 5 6
−1 if z < 0
Size (𝑐𝑚2 )

9
Frank Rossenblatt (1943)
𝑛
Perceptron: An Example 𝑦ො = ℎ𝑤 𝑥 = 𝑔(෍ 𝑤𝑖 𝑥𝑖 )
𝑖=0

1
+1 if 𝑤0 + 𝑤1 𝑥 ≥ 0
ℎ𝑤 𝑥 = ቊ
𝑤0 −1 otherwise

+1 if − 3.5 + 1𝑥 ≥ 0
𝑤1 ℎ𝑤 𝑥 = ቊ
𝑥1 ∑ ∫ 𝑦ො −1 otherwise
Size (𝑐𝑚2 )
Sum Activation Output Cancer prediction: benign (-1), malignant (1)

Inputs Weights +1
Sign function Prediction
+1 if 𝑧 ≥ 0 -1
𝑔 𝑧 =ቊ 1 2 3 4 5 6
−1 if z < 0
Size (𝑐𝑚2 )

How do we learn 𝒘 while g is not differentiable?


10
Frank Rossenblatt (1943)

Perceptron Learning Algorithm


• Initialize ∀𝑖 𝑤𝑖 = 0
• Loop (until convergence or max steps reached)
• For each instance 𝑥 𝑖 , 𝑦 𝑖 , classify 𝑦ො (𝑖) = ℎ𝑤 (𝑥 (𝑖) )
• Select one misclassified instance 𝑥 𝑗 , 𝑦 𝑗
• Update weights: 𝑤 ← 𝑤 + 𝛾 𝑦 (𝑗) − 𝑦ො (𝑗) 𝑥 (𝑗)

Learning rate

11
Frank Rossenblatt (1943)

Perceptron Learning Algorithm: An Example


𝑥2 Sign function
+1 if 𝑧 ≥ 0
𝑔 𝑧 =ቊ
−1 if z < 0

𝑦ො = ℎ𝑤 𝑥 = 𝑔 𝑤0 +𝑤1 𝑥1 + 𝑤2 𝑥2
= 𝑔(0 + 1𝑥1 + 0.5𝑥2 )
𝑥1

𝑤← 𝑤 + 𝛾 𝑦 (𝑖) − 𝑦ො (𝑖) 𝑥 (𝑖)


0 1 −0.2
2, −2 1 + 0.1(−1 − 1) 2 = 0.6
0.5 −2 0.9

New ℎ𝑤 𝑥 = 𝑔(−0.2 + 0.6𝑥1 + 0.9𝑥2 )


12
Frank Rossenblatt (1943)

Perceptron Learning Algorithm: An Example


𝑥2 Sign function
+1 if 𝑧 ≥ 0
𝑔 𝑧 =ቊ
−1 if z < 0
−2, 2

𝑦ො = ℎ𝑤 𝑥 = 𝑔 𝑤0 +𝑤1 𝑥1 + 𝑤2 𝑥2
= 𝑔(−0.2 + 0.6𝑥1 + 0.9𝑥2 )
𝑥1

𝑤← 𝑤 + 𝛾 𝑦 (𝑖) − 𝑦ො (𝑖) 𝑥 (𝑖)


−0.2 1 −0.4
0.6 + 0.1(−1 − 1) −2 = 1
0.9 2 0.5

New ℎ𝑤 𝑥 = 𝑔(−0.4 + 1𝑥1 + 0.5𝑥2 )


13
Frank Rossenblatt (1943)

Perceptron Learning Algorithm: An Example


𝑥2 Sign function
+1 if 𝑧 ≥ 0
𝑔 𝑧 =ቊ
−1 if z < 0

𝑦ො = ℎ𝑤 𝑥 = 𝑔 𝑤0 +𝑤1 𝑥1 + 𝑤2 𝑥2
= 𝑔(−0.4 + 1𝑥1 + 0.5𝑥2 )
𝑥1

𝑤← 𝑤 + 𝛾 𝑦 (𝑖) − 𝑦ො (𝑖) 𝑥 (𝑖)


−0.4 1 −0.6
2, −2 1 + 0.1(−1 − 1) 2 = 0.6
0.5 −2 0.9

New ℎ𝑤 𝑥 = 𝑔(−0.6 + 0.6𝑥1 + 0.9𝑥2 )


14
Frank Rossenblatt (1943)

Perceptron Learning Algorithm: An Example


𝑥2 Sign function
+1 if 𝑧 ≥ 0
𝑔 𝑧 =ቊ
−1 if z < 0
−2, 2

𝑦ො = ℎ𝑤 𝑥 = 𝑔 𝑤0 +𝑤1 𝑥1 + 𝑤2 𝑥2
= 𝑔(−0.6 + 0.6𝑥1 + 0.9𝑥2 )
𝑥1

𝑤← 𝑤 + 𝛾 𝑦 (𝑖) − 𝑦ො (𝑖) 𝑥 (𝑖)


−0.6 1 −0.8
0.6 + 0.1(−1 − 1) −2 = 1
0.9 2 0.5

New ℎ𝑤 𝑥 = 𝑔(−0.8 + 1𝑥1 + 0.5𝑥2 )


15
Frank Rossenblatt (1943)

Perceptron Learning Algorithm: An Example


𝑥2 Sign function
+1 if 𝑧 ≥ 0
𝑔 𝑧 =ቊ
−1 if z < 0

𝑦ො = ℎ𝑤 𝑥 = 𝑔 𝑤0 +𝑤1 𝑥1 + 𝑤2 𝑥2
= 𝑔(−0.8 + 1𝑥1 + 0.5𝑥2 )
𝑥1

𝑤← 𝑤 + 𝛾 𝑦 (𝑖) − 𝑦ො (𝑖) 𝑥 (𝑖)


−0.8 1 −1
2, −2 1 + 0.1(−1 − 1) 2 = 0.6
0.5 −2 0.9

New ℎ𝑤 𝑥 = 𝑔(−1 + 0.6𝑥1 + 0.9𝑥2 )


16
Frank Rossenblatt (1943)

Perceptron Learning Algorithm: An Example


𝑥2 Sign function
+1 if 𝑧 ≥ 0
𝑔 𝑧 =ቊ
−1 if z < 0

𝑦ො = ℎ𝑤 𝑥 = 𝑔 𝑤0 +𝑤1 𝑥1 + 𝑤2 𝑥2
= 𝑔(−1 + 0.6𝑥1 + 0.9𝑥2 )
𝑥1
No misclassifications! Converged!

What if it’s not linearly separable?


The algorithm will not converge

17
Frank Rossenblatt (1943)

Perceptron Learning Algorithm: Why?


𝑛
+1 𝑧≥0
𝑦ො = 𝑔 ෍ 𝑤𝑖 𝑥𝑖 𝑔 𝑧 = ቊ
−1 𝑧<0
𝑖=0

When there is a misclassification: 𝑦 = +1, 𝑦ො = −1, or vice versa 𝑥

𝑦ො = −1 𝑦ො = +1 𝑤
Doesn’t affect 𝑤′ 𝑥 Doesn’t affect 𝜃
the sign 𝑤𝑇𝑥 < 0 𝜃′ 𝑤 the sign 𝑤𝑇𝑥 > 0
𝜃 𝑤⋅𝑥 >0 𝜃′ −𝑥
𝑤⋅𝑥 <0
𝑤 . 𝑥 cos 𝜃 < 0 𝑥 𝑤 . 𝑥 cos 𝜃 > 0 𝑤′
cos 𝜃 < 0 cos 𝜃 > 0
𝑤′ ← 𝑤 + 𝑥 𝑤′ ← 𝑤 − 𝑥
𝜋 𝜋
<𝜃<𝜋 𝜃 ′ will be smaller 0<𝜃< 𝜃 ′ will be larger
2 (as required) 2 (as required)
What we want: 𝑦 = +1 What we want: 𝑦 = −1
cos 𝜃 > 0 cos 𝜃 < 0
𝜋 𝜋
0<𝜃< <𝜃<𝜋 18
2 2
Frank Rossenblatt (1943)

Perceptron Learning Algorithm: Why?


𝑛
+1 𝑧≥0
𝑦ො = 𝑔 ෍ 𝑤𝑖 𝑥𝑖 𝑔 𝑧 = ቊ
−1 𝑧<0
𝑖=0

When there is a misclassification: 𝑦 = +1, 𝑦ො = −1, or vice versa 𝑥

𝑦ො = −1 𝑦ො = +1 𝑤
𝑤′ 𝑥
𝑦 = +1 𝑦 = −1 𝜃
𝜃′ 𝑤 −𝑥
𝜃 𝜃′

𝑥 𝑤′

𝑤′ ← 𝑤 + 𝑥 𝑤′ ← 𝑤 − 𝑥
𝜃 ′ will be smaller 𝜃 ′ will be larger
(as required) (as required)
𝑤 ← 𝑤 + 𝛾 𝑦 − 𝑦ො 𝑥 𝑤 ← 𝑤 + 𝛾 𝑦 − 𝑦ො 𝑥
← 𝑤 + 𝛾 +1 − (−1) 𝑥 ← 𝑤 + 𝛾 −1 + (−1) 𝑥
← 𝑤 + 2𝛾𝑥 ← 𝑤 − 2𝛾𝑥 19
Outline
• Perceptron
• Biological inspiration
• Perceptron Learning Algorithm
• Neural Networks
• Single-layer Neural Networks
• Multi-layer Neural Networks
• Regression and Classification
• Neural Networks with Gradient Descent
• Neural Networks vs Other Models

20
Frank Rossenblatt (1943)

Perceptron
1
𝑥1
𝑤0
𝑥2 𝑤1 𝑛
𝑤2 ∑ ∫ 𝑦ො 𝑦ො = ℎ𝑤 𝑥 = 𝑔(෍ 𝑤𝑖 𝑥𝑖 )
𝑥3 𝑤3 𝑖=0
Sum Activation Output
𝑥4 𝑤4

Inputs Weights

Sign function
+1 if 𝑧 ≥ 0
𝑔 𝑧 =ቊ
−1 if z < 0

21
Single-layer Neural Networks
1
𝑥1
𝑤0
𝑥2 𝑤1 𝑛
𝑤2 ∑ ∫ 𝑦ො 𝑦ො = ℎ𝑤 𝑥 = 𝑔(෍ 𝑤𝑖 𝑥𝑖 )
𝑥3 𝑤3 𝑖=0
Sum Activation Output
𝑥4 𝑤4

Inputs Weights
Activation Function
(usually non-linear)

22
Single-layer Neural Networks
1
𝑥1
𝑤0
𝑥2 𝑤1 𝑛
𝑤2 ∑ ∫ 𝑦ො 𝑦ො = ℎ𝑤 𝑥 = 𝑔(෍ 𝑤𝑖 𝑥𝑖 )
𝑥3 𝑤3 Output 𝑖=0
𝑥4 𝑤4

Inputs Weights

Activation Function
(usually non-linear)

23
Single-layer Neural Networks
1 𝑦ො = ℎ𝑤 𝑥 = 𝜎 ∑𝑛𝑖=0 𝑤𝑖 𝑥𝑖
𝑥1
𝑤0
𝑥2 𝑤1
𝑤2 ∑ ∫ 𝑦ො
𝑥3 𝑤3 Output

𝑥4 𝑤4
Sigmoid
Inputs Weights 1
𝜎 𝑧 =
1 + 𝑒 −𝑧

24
Single-layer Neural Networks: NOT
1 𝑦ො = ℎ𝑤 𝑥 = 𝜎 ∑𝑛𝑖=0 𝑤𝑖 𝑥𝑖

𝑤0 = 10

Consider 𝑥1 ∈ 1,0
𝑥1 ∑ ∫ 𝑦ො
𝑤1= −20 𝒙𝟏 𝒚 ∑ ෝ
𝒚
Output
0 1 10 0.999
1 0 −10 0.00004
Sigmoid
Inputs Weights 1
𝜎 𝑧 =
1 + 𝑒 −𝑧

25
Single-layer Neural Networks: AND
1 𝑦ො = ℎ𝑤 𝑥 = 𝜎 ∑𝑛𝑖=0 𝑤𝑖 𝑥𝑖

𝑤0 = −30
𝑥1 𝑤1 = 20
Consider 𝑥1 , 𝑥2 ∈ 1,0
∑ ∫ 𝑦ො
𝒙𝟏 𝒙𝟐 𝒚 ∑ ෝ
𝒚
𝑥2 Output
𝑤2 = 20 0 0 0 −30 0.000 …
0 1 0 −10 0.00004
Sigmoid
Inputs Weights 1 1 0 0 −10 0.00004
𝜎 𝑧 =
1 + 𝑒 −𝑧 1 1 1 10 0.999

26
Single-layer Neural Networks: OR
1 𝑦ො = ℎ𝑤 𝑥 = 𝜎 ∑𝑛𝑖=0 𝑤𝑖 𝑥𝑖

𝑤0 = −10
𝑥1 𝑤1 = 20
Consider 𝑥1 , 𝑥2 ∈ 1,0
∑ ∫ 𝑦ො
𝒙𝟏 𝒙𝟐 𝒚 ∑ ෝ
𝒚
𝑥2 Output
𝑤2 = 20 0 0 0 −10 0.00004
0 1 1 10 0.999
Sigmoid
Inputs Weights 1 1 0 1 10 0.999
𝜎 𝑧 =
1 + 𝑒 −𝑧 1 1 1 30 0.999

27
Single-layer Neural Networks: NOR (NOT 𝑥1 ) AND (NOT 𝑥2 )

1 𝑦ො = ℎ𝑤 𝑥 = 𝜎 ∑𝑛𝑖=0 𝑤𝑖 𝑥𝑖

𝑤0 = 10
𝑥1 𝑤1 = −20
Consider 𝑥1 , 𝑥2 ∈ 1,0
∑ ∫ 𝑦ො
𝒙𝟏 𝒙𝟐 𝒚 ∑ ෝ
𝒚
𝑥2 Output
𝑤2 = −20 0 0 1 10 0.999
0 1 0 −10 0.00004
Sigmoid
Inputs Weights 1 1 0 0 −10 0.00004
𝜎 𝑧 =
1 + 𝑒 −𝑧 1 1 0 −30 0.000 …

28
Single-layer Neural Networks: XNOR eXclusive Not OR

1 𝑦ො = ℎ𝑤 𝑥 = 𝜎 ∑𝑛𝑖=0 𝑤𝑖 𝑥𝑖

𝑤0 = ?
𝑥1 𝑤1 = ?
Consider 𝑥1 , 𝑥2 ∈ 1,0
∑ ∫ 𝑦ො
𝒙𝟏 𝒙𝟐 𝒚
𝑥2 Output
𝑤2 = ? 0 0 1
0 1 0
Sigmoid
Inputs Weights 1 1 0 0
𝜎 𝑧 =
1 + 𝑒 −𝑧 1 1 1
𝑥2

Not linearly separable!

𝑥1 29
Single-layer Neural Networks: XNOR eXclusive Not OR

1 1 1 𝑦ො = ℎ𝑤 𝑥 = 𝜎 ∑𝑛𝑖=0 𝑤𝑖 𝑥𝑖
20 −30 20 −10 −20 10
𝑥1 𝑥1 𝑥1
AND 𝑦ො OR 𝑦ො NOR 𝑦ො
𝑥2 20
𝑥2 20 𝑥2 −20
Consider 𝑥1 , 𝑥2 ∈ 1,0
𝒙𝟏 𝒙𝟐 𝒚
0 0 1
0 1 0
1 0 0
1 1 1
𝑥2

Not linearly separable!

𝑥1 30
Multi
Single-layer Neural Networks: XNOR eXclusive Not OR

𝑦ො = ℎ𝑤 𝑥 = 𝜎 ∑𝑛𝑖=0 𝑤𝑖 𝑥𝑖

1
−30
20 Consider 𝑥1 , 𝑥2 ∈ 1,0
𝑥1
AND 𝑎1(2) 1
𝒙𝟏 𝒙𝟐 𝒚
20 −10
20
(3) 0 0 1
1 𝑎1
OR 𝑦ො 0 1 0
10
−20 20
NOR 𝑎(2)
1 0 0
𝑥2 2
−20 1 1 1
𝑥2

Not linearly separable!

𝑥1 31
Multi-layer Neural Networks: XNOR eXclusive Not OR

(3) (2) (2) (2)


𝑦ො = 𝜎 ∑𝑛𝑖=0 𝑤𝑖1 𝑎𝑖 𝑎𝑖 = 𝜎 ∑𝑛𝑗=0 𝑤𝑗𝑖 𝑥𝑗
(2)
Function composition 𝑎0 = 1
1
−30
20 Consider 𝑥1 , 𝑥2 ∈ 1,0
𝑥1
AND 𝑎1(2) 1 (𝟐) (𝟐)
𝒙𝟏 𝒙𝟐 𝒚 𝒂𝟏 𝒂𝟐 ෝ
𝒚
20 −10
20
(3) 0 0 1 0 1 1
1 𝑎1
OR 𝑦ො 0 1 0 0 0 0
10
−20 20
NOR 𝑎(2)
1 0 0 0 0 0
𝑥2 2
−20 1 1 1 1 0 1

𝑥2
Input Hidden Output
Layer Layer Layer
Not linearly separable!

𝑥1 32
Multi-layer Neural Networks: |𝑥 − 1|

1 1
1
0
−1
1
+1 +1
[1] [2] 𝑦ො
𝑥 𝑎1 𝑎1
+1
𝑔[2] = None
−1 +1 max 0, 𝑥 − 1 + max(0, 1 − 𝑥)
[1]
𝑎2

Which activation function(s)? 𝑔[1] = ReLU max(0, 𝑥)


max(0, 𝑥 − 1)
max(0, 1 − 𝑥) 33
Multi-layer Neural Networks
1 1 1

𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1

⋮ ⋮ ⋮
⋮ 1 1 1

(𝑙−1) (𝑙) (𝑙+1)


𝑥𝑗 … 𝑎𝑗 𝑎𝑗 𝑎𝑗

⋮ ⋮ ⋮
⋮ 1 1 1

𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1

Layer 𝑙 − 1 Layer 𝑙 Layer 𝑙 + 1

34
Multi-layer Neural Networks
1 1 1

𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1

⋮ ⋮ ⋮
⋮ 1 1 1

(𝑙−1) (𝑙) (𝑙+1)


𝑥𝑗 … 𝑎𝑗 𝑎𝑗 𝑎𝑗

⋮ ⋮ ⋮
⋮ 1 1 1

𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1

Layer 𝑙 − 1 Layer 𝑙 Layer 𝑙 + 1

35
Multi-layer Neural Networks
1 1 1

𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1

⋮ ⋮ ⋮
⋮ 1 1 1

(𝑙−1) (𝑙) (𝑙+1)


𝑥𝑗 … 𝑎𝑗 𝑎𝑗 𝑎𝑗

⋮ ⋮ ⋮
⋮ 1 1 1

𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1

Layer 𝑙 − 1 Layer 𝑙 Layer 𝑙 + 1

36
Multi-layer Neural Networks
1 1 1

𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1

⋮ ⋮ ⋮
⋮ 1 1 1

(𝑙−1) (𝑙) (𝑙+1)


𝑥𝑗 … 𝑎𝑗 𝑎𝑗 𝑎𝑗

⋮ ⋮ ⋮
⋮ 1 1 1

𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1

Layer 𝑙 − 1 Layer 𝑙 Layer 𝑙 + 1

37
Multi-layer Neural Networks
1 1 1

𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1

⋮ ⋮ ⋮
⋮ 1 1 1

(𝑙−1) (𝑙) (𝑙+1)


𝑥𝑗 … 𝑎𝑗 𝑎𝑗 𝑎𝑗

⋮ ⋮ ⋮
⋮ 1 1 1

𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1

Layer 𝑙 − 1 Layer 𝑙 Layer 𝑙 + 1

38
Multi-layer Neural Networks
1 1 1

𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1

⋮ ⋮ ⋮
⋮ 1 1 1

(𝑙−1) (𝑙) (𝑙+1)


𝑥𝑗 … 𝑎𝑗 𝑎𝑗 𝑎𝑗

⋮ ⋮ ⋮
⋮ 1 1 1

𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1

Layer 𝑙 − 1 Layer 𝑙 Layer 𝑙 + 1

39
Multi-layer Neural Networks
1 1 1

𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1

⋮ ⋮ ⋮
⋮ 1 1 1

(𝑙−1) (𝑙) (𝑙+1)


𝑥𝑗 … 𝑎𝑗 𝑎𝑗 𝑎𝑗

⋮ ⋮ ⋮
⋮ 1 1 1

𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1

Layer 𝑙 − 1 Layer 𝑙 Layer 𝑙 + 1

40
Multi-layer Neural Networks
1 1 1

𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1

⋮ ⋮ ⋮
⋮ 1 1 1

(𝑙−1) (𝑙) (𝑙+1)


𝑥𝑗 … 𝑎𝑗 𝑎𝑗 𝑎𝑗

⋮ ⋮ ⋮
⋮ 1 1 1

𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1

Layer 𝑙 − 1 Layer 𝑙 Layer 𝑙 + 1

41
Multi-layer Neural Networks
1 1 1

𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1

⋮ ⋮ ⋮
⋮ 1 1 1

(𝑙−1) (𝑙) (𝑙+1)


𝑥𝑗 … 𝑎𝑗 𝑎𝑗 𝑎𝑗

⋮ ⋮ ⋮
⋮ 1 1 1

𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1

Layer 𝑙 − 1 Layer 𝑙 Layer 𝑙 + 1

42
Multi-layer Neural Networks
1 1 1

𝑥1 … (𝑙) (𝑙+1)
(𝑙−1)
𝑎1 𝑎1 𝑎1 Universal Function Approximation Theorem
⋮ ⋮ ⋮ Neural networks can represent a wide variety of
⋮ 1 1 1 interesting functions with appropriate weights

(𝑙−1) (𝑙) (𝑙+1)


• A single hidden layer network can
𝑥𝑗 … 𝑎𝑗 𝑎𝑗 𝑎𝑗
approximate any continuous function within
⋮ ⋮ ⋮ a specific range
⋮ 1 1 1

𝑥𝑛 … (𝑙−1)
𝑎𝑚𝑙−1
(𝑙)
𝑎𝑚𝑙
(𝑙+1)
𝑎𝑚𝑙+1

Layer 𝑙 − 1 Layer 𝑙 Layer 𝑙 + 1

43
Neural Networks and Matrix Multiplication (1)
𝑥1
𝑊11

𝑊21
𝑦ො1
𝑊31

𝑥2
𝑊12
𝑊22

𝑦ො2
𝑊32
𝑥3
# Input (number of weights per neuron / input variables)
𝑇
𝑥1 𝑊11 𝑊12 𝑊11 𝑊12 𝑥1 𝑥1
𝑊11 𝑊21 𝑊31 𝑦ො1
𝒙 = 𝑥2 𝑊 = 𝑊21 𝑊22 ෝ = 𝑔 𝑊𝑇𝒙 = 𝑔
𝒚 𝑊21 𝑊22 𝑥2 =𝑔 𝑥 =
𝑥3 𝑊31 𝑊32 𝑥3 𝑊12 𝑊22 𝑊32 𝑥2 𝑦ො2
𝑊31 𝑊32 3

# Output (number of layer’s neurons / output variables)


44
Neural Networks and Matrix Multiplication (2)
𝑊 [1]
𝑥1
[2]
𝑊11
[1] [1] [1]
𝑊11 𝑊12 𝑊13
[2]
𝑊21 [1] [1] [1]
𝑦ො1 𝑊 [1] = 𝑊21 𝑊22 𝑊23
𝑥1 [2]
𝑊31 [1] [1] [1]
𝑊31 𝑊32 𝑊33
𝒙 = 𝑥2 𝑥2 [2]
𝑊12
𝑥3 𝑊22
[2]
[2]
𝑊11 𝑊12
[2]

[2] [2]
𝑦ො2 𝑊 [2] = 𝑊21 𝑊22
[2] [2] [2]
𝑊32 𝑊31 𝑊32
𝑥3

[2] [2] 𝑇 [1] [1] [1] 𝑇


𝑊11 𝑊12 𝑊11 𝑊12 𝑊13 𝑥1
2 𝑇 𝑔[1] 1 𝑇𝒙 [2] [2] [1] [1] [1] 𝑦ො1
ෝ = 𝑔[2] 𝑊
𝒚 𝑊 = 𝑔[1] 𝑊21 𝑊22 𝑔[2] 𝑊21 𝑊22 𝑊23 𝑥2 =
𝑥3 𝑦ො2
[2] [2] [1] [1] [1]
𝑊31 𝑊32 𝑊31 𝑊32 𝑊33
45
Neural Networks and Matrix Multiplication (3)
𝑊 [1] 𝑊 [2] 𝑊 [𝐿−1] 𝑊 [𝐿]
𝑥1 𝑥0 𝑥0 …

⋮ ⋮ ⋮ ⋮ 𝑦ො1

𝑥𝑗 𝑥1 𝑥1 … ⋮

⋮ ⋮ ⋮ ⋮ 𝑦ො𝐶

𝑥𝑛 𝑥2 𝑥2 …
Forward Propagation
𝑦ො1
ෝ=𝑔
𝒚 𝐿 𝑊 𝐿 𝑇 …𝑔 𝐿−1 𝑊 𝐿−1 𝑇 …𝑔 𝑙 𝑊 𝑙 𝑇 …𝑔 2 𝑊 2 𝑇𝑔 1 𝑊 1 𝑇𝒙 = …
𝑦ො𝐶

46
Regression and Classification
1 1

𝑥1 … 𝑥1 …

⋮ ⋮
⋮ 1 1 ⋮ 1 1

𝑥𝑗 … 𝑦ො ∈ ℝ 𝑥𝑗 … 𝑦ො ∈ [0,1]
⋮ ⋮
⋮ 1 ⋮ 1
Linear/no Sigmoid
𝑥𝑛 … activation 𝑥𝑛 … activation
𝑔 𝑥 =𝑥

Regression Binary Classification


47
Regression and Classification Softmax activation
𝑒 𝑧𝑖
𝑔 𝑧 = 𝐶 𝑧𝑗
1 1 1 ∑ 𝑗=1 𝑒

𝑥1 … 𝑥1 …
𝑦ො1 ∈ [0,1]
⋮ ⋮ ⋮
⋮ 1 1 ⋮ 1 1

𝑥𝑗 … 𝑦ො ∈ ℝ 𝑥𝑗 … 𝑦ො𝑖 ∈ [0,1]
⋮ ⋮ ⋮
⋮ 1 ⋮ 1 1
Linear/no
𝑥𝑛 … activation 𝑥𝑛 … 𝑦ො𝐶 ∈ [0,1]
𝑔 𝑥 =𝑥
𝐶

Regression Multi-class Classification ෍ 𝑦ො𝑖 = 1


with 𝐶 classes 𝑖=1
48
Outline
• Perceptron
• Biological inspiration
• Perceptron Learning Algorithm
• Neural Networks
• Single-layer Neural Networks
• Multi-layer Neural Networks
• Regression and Classification
• Gradient Descent with Neural Networks
• Neural Networks vs Other Models

49
Background: Chain Rule
𝑧 𝑥 =ℎ 𝑔 𝑓 𝑥

𝑑𝑧 𝑑𝑧 𝑑ℎ 𝑑𝑔 𝑑𝑓
=
𝑑𝑥 𝑑ℎ 𝑑𝑔 𝑑𝑓 𝑑𝑥

50
Aside: Derivative of Sigmoid Function
1
1 𝑑 1 + 𝑒 −𝑥
𝜎 𝑥 = 𝜎′ 𝑥 =
1 + 𝑒 −𝑥 𝑑𝑥
𝑑 1 + 𝑒 −𝑥 −1
=
𝑑𝑥
= − 1 + 𝑒 −𝑥 −2 (−𝑒 −𝑥 )
𝑒 −𝑥
=
1 + 𝑒 −𝑥 2
1 𝑒 −𝑥
= .
1 + 𝑒 −𝑥 1 + 𝑒 −𝑥
1 1
= 1 −
1 + 𝑒 −𝑥 1 + 𝑒 −𝑥
=𝜎 𝑥 1−𝜎 𝑥

51
Gradient Descent on Single-layer NN
𝑛
1 𝑑𝐿
𝑦ො = 𝜎 𝑓 𝑥 𝑓 𝑥 = ෍ 𝑤𝑖 𝑥𝑖 𝐿 = 𝑦ො − 𝑦 2 𝑤𝑖 ← 𝑤𝑖 − 𝛾
2 𝑑𝑤𝑖
𝑖=0
Mean-squared Error 𝑤𝑖 ← 𝑤𝑖 − 𝛾 𝑦ො − 𝑦 𝑦ො 1 − 𝑦ො 𝑥𝑖

1 2
𝑑𝐿 𝑑𝐿 𝑑𝑦ො 𝑑𝑓 𝑑𝐿 𝑑(2 𝑦ො − 𝑦 )
= = = 𝑦ො − 𝑦
𝑑𝑤𝑖 𝑑𝑦ො 𝑑𝑓 𝑑𝑤𝑖 𝑑 𝑦ො 𝑑𝑦ො

𝑑 𝑦ො 𝑑(𝜎 𝑓 𝑥 )
= 𝑦ො − 𝑦 𝑦(1
ො − 𝑦)𝑥
ො 𝑖 = =𝜎 𝑓 𝑥 1−𝜎 𝑓 𝑥 = 𝑦(1
ො − 𝑦)

𝑑𝑓 𝑓
0
1 1 𝑑𝑓 𝑑(∑𝑖=0 𝑤𝑖 𝑥𝑖 ) 𝑑 𝑤𝑖 𝑥𝑖 𝑑 ∑𝑘≠𝑖 𝑤𝑘 𝑥𝑘
= = + = 𝑥𝑖
20 −30 20 −10 𝑑𝑤𝑖 𝑑𝑤𝑖 𝑑𝑤𝑖 𝑑𝑤𝑖
𝑥1 𝑥1
AND 𝑦ො OR 𝑦ො
𝑥2 20
𝑥2 20
52
Gradient Descent on Multi-layer NN
Backpropagation is covered in the next Lecture!

53
Outline
• Perceptron
• Biological inspiration
• Perceptron Learning Algorithm
• Neural Networks
• Single-layer Neural Networks
• Multi-layer Neural Networks
• Regression and Classification
• Neural Networks with Gradient Descent
• Neural Networks vs Other Models

54
Neural Networks vs Logistic Regression

𝑥1
Sigmoid

𝑥𝑗

𝑥𝑛

Linear, non-robust decision boundary


Prone to misclassification since the decision
Linear Model boundary can be too close to data points55
Neural Networks vs Logistic Regression
With Feature Mapping

𝑥1
𝜙
Sigmoid
⋮ ⋮

𝑥𝑗
𝜙
⋮ ⋮

𝑥𝑛
𝜙
Non-linear, non-robust decision boundary
Prone to misclassification since the decision
Handcrafted Feature Mapping Linear Model boundary can be too close to data points56
Neural Networks vs Support Vector Machines
Implicit with Kernel: can have many, possibly infinite dimensional features
𝑥1
𝜙
Sign
⋮ ⋮

𝑥𝑗
𝜙
⋮ ⋮

𝑥𝑛
𝜙
Non-linear, robust decision boundary
Decision boundary is guaranteed to be far
Handcrafted Feature Mapping Linear Model from the data points 57
Neural Networks vs Other Methods
1 1

Credit: Internet meme, original source unknown


𝑥1 …
Activation function
(usually non-linear)
⋮ ⋮
1 1

𝑥𝑗 …
⋮ ⋮
1 1

𝑥𝑛 …

Non-linear, non-robust decision boundary


Prone to misclassification since the decision
Learned Feature Mapping Linear Model boundary can be too close to data points58
Summary
• Perceptron
• Biological inspiration: brain, neural network, neuron
• Perceptron Learning Algorithm:
• 𝑤 ← 𝑤 + 𝛾 𝑦 (𝑗) − 𝑦ො (𝑗) 𝑥 (𝑗) on a misclassified instance
• Neural Networks
• Single-layer Neural Networks: AND, OR, NOR
• Multi-layer Neural Networks: XNOR, Universal Approximation Theorem
• Regression and Classification
• Neural Networks with Gradient Descent
• Neural Networks vs Other Models: learned feature mapping!
59
Coming Up Next Week
• Math revision
• Linear algebra: scalar, vector, matrix, and their operations
• Calculus: partial derivative, matrix calculus, chain rule
• Backpropagation
• Introduction to PyTorch
• Training neural networks with gradient descent

60
To Do
• Lecture Training 8
• +100 Free EXP
• +50 Early bird bonus
• PS6 is out today!
• May need some knowledge from the next lecture (Lecture 9)

61

You might also like