Ann MPDM
Ann MPDM
202223
Geoffrey Hinton
The AI Buzz
All of the algorithms used in the previous applications…
How the brain looks like to the naked eye What some neuronal pathways look like
using diffusion sprectrum imaging (DSI)
Credit to: http://www.humanconnectomeproject.org/
The biological neuron
????
Outputs
A
G
E
N
D
A
1 An historical introduction
An historical introduction
Modelling the brain: the main inspiration for Deep Learning
1 An historic perspective
1950 1960
x1 x2 y
Training a perceptron (numerical example): 0 0 0
a) 4 instances 0 1 0
Second instance: x1 x2 y
0 0 0
𝑤1 = 0.9 𝑤2 = 0.9 𝜃 = 0.5 𝛼 = 0.5
0 1 0
i. Calculate output of linear section
1 0 0
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.9 × 0 + 0.9 × 1 = 0.9
1 1 1
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 1
Third instance: x1 x2 y
0 0 0
𝑤1 = 0.9 𝑤2 = 0.4 𝜃 = 0.5 𝛼 = 0.5
0 1 0
i. Calculate output of linear section
1 0 0
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.9 × 1 + 0.4 × 0 = 0.9
1 1 1
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 1
Fourth instance: x1 x2 y
0 0 0
𝑤1 = 0.4 𝑤2 = 0.4 𝜃 = 0.5 𝛼 = 0.5
0 1 0
i. Calculate output of linear section
1 0 0
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.4 × 1 + 0.4 × 1 = 0.8
1 1 1
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 1
First instance: x1 x2 y
0 0 0
𝑤1 = 0.4 𝑤2 = 0.4 𝜃 = 0.5 𝛼 = 0.5
0 1 0
i. Calculate output of linear section
1 0 0
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.4 × 0 + 0.4 × 0 = 0
1 1 1
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 0
Second instance: x1 x2 y
0 0 0
𝑤1 = 0.4 𝑤2 = 0.4 𝜃 = 0.5 𝛼 = 0.5
0 1 0
i. Calculate output of linear section
1 0 0
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.4 × 0 + 0.4 × 1 = 0.4
1 1 1
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 0
Third instance: x1 x2 y
0 0 0
𝑤1 = 0.4 𝑤2 = 0.4 𝜃 = 0.5 𝛼 = 0.5
0 1 0
i. Calculate output of linear section
1 0 0
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.4 × 1 + 0.4 × 0 = 0.4
1 1 1
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 0
McCulloch & Pitts (1943): networks Minsky and Papert (1969): The
of binary neurons can do logic limitations of the Perceptron
First instance: x1 x2 y
0 0 0
𝑤1 = 0.9 𝑤2 = 0.9 𝜃 = 0.5 𝛼 = 0.5
0 1 1
i. Calculate output of linear section
1 0 1
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.9 × 0 + 0.9 × 0 = 0
1 1 0
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 0
Second instance: x1 x2 y
0 0 0
𝑤1 = 0.9 𝑤2 = 0.9 𝜃 = 0.5 𝛼 = 0.5
0 1 1
i. Calculate output of linear section
1 0 1
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.9 × 0 + 0.9 × 1 = 0.9
1 1 0
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 1
Third instance: x1 x2 y
0 0 0
𝑤1 = 0.9 𝑤2 = 0.9 𝜃 = 0.5 𝛼 = 0.5
0 1 1
i. Calculate output of linear section
1 0 1
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.9 × 1 + 0.9 × 0 = 0.9
1 1 0
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 1
Fourth instance: x1 x2 y
0 0 0
𝑤1 = 0.9 𝑤2 = 0.9 𝜃 = 0.5 𝛼 = 0.5
0 1 1
i. Calculate output of linear section
1 0 1
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.9 × 1 + 0.9 × 1 = 1.8
1 1 0
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 1
First instance: x1 x2 y
0 0 0
𝑤1 = 0.4 𝑤2 = 0.4 𝜃 = 0.5 𝛼 = 0.5
0 1 1
i. Calculate output of linear section
1 0 1
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.9 × 0 + 0.9 × 0 = 0
1 1 0
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 0
Second instance: x1 x2 y
0 0 0
𝑤1 = 0.4 𝑤2 = 0.4 𝜃 = 0.5 𝛼 = 0.5
0 1 1
i. Calculate output of linear section
1 0 1
𝑤1 𝑥1 + 𝑤2 𝑥2 = 0.4 × 0 + 0.4 × 1 = 0.4
1 1 0
ii. Compare result with 𝜃: assign 𝑦ො = 0 if lower and 𝑦ො = 1 otherwise
𝑦ො = 0
Able to Solve the AND Problem Incapable of Solving the XOR Problem
1 An historic perspective
McCulloch & Pitts (1943): networks Minsky and Papert (1969): The
of binary neurons can do logic limitations of the Perceptron
To backpropagate error:
1. Start backwards and compute the contribution of each operation to the result
2. Uses the chain rule to compute the gradients
𝒙 – input vector
𝒘 – weight vector
𝒙
∙ 𝒘𝒙
+
𝒛
𝒇 𝒉
𝒃 – bias of linear
𝝏𝒉 𝝏𝒉 𝒛 = 𝒘𝒙 + 𝒃
𝒘 𝒃 𝝏𝒛 𝝏𝒉 𝒉 = 𝒇(𝒛)
𝝏𝒉 𝝏𝒉
𝝏𝒘 𝝏𝒃
1 Backpropagation
To backpropagate error:
1. Start backwards and compute the contribution of each operation to the result
2. Uses the chain rule to compute the gradients:
i. Each of these operation nodes has a local gradient, consider the example where
we have 𝒛 = 𝒘𝒙
𝒙
Local gradient
𝒘
𝝏𝒛
𝝏𝒘 ∙ 𝒛
𝝏𝒉
𝝏𝒉 𝝏𝒛
𝝏𝒘
Downstream gradient Upstream gradient
1 Backpropagation
To backpropagate error:
1. Start backwards and compute the contribution of each operation to the result
2. Uses the chain rule to compute the gradients:
i. Each of these operation nodes has a local gradient, consider the example where
we have 𝒛 = 𝒘𝒙
𝝏𝒉 𝝏𝒉 𝝏𝒛
= 𝒙 𝝏𝒛
𝝏𝒙 𝝏𝒛 𝝏𝒙
𝝏𝒙
𝒘
𝝏𝒛 ∙ 𝒛
𝝏𝒉
𝝏𝒘
𝝏𝒉 𝝏𝒉 𝝏𝒛 𝝏𝒛
=
𝝏𝒘 𝝏𝒛 𝝏𝒘
1 Chapter 1 – Main Takeaways
2. Backpropagation:
i. Introduces the ability to use gradient-based learning
ii. Different logic for weight updates
iii. Allows the use of non-linear activation function: allowing for non-linear relationships
1 Chapter 1 – Main Takeaways
Core components of modern neural networks (that also exist in larger networks):
i. The forward pass starts from the current weights and inputs to compute the results of the
operations
ii. The backward pass computes the gradients that led to the classification output using
backpropagation
iii. Backpropagation is the algorithm used to apply the chain rule along a computational
graph:
Downstream gradient = Upstream gradient x local gradient
2
Input Layer:
i. Introduces inputs to the network
ii. No processing or activation function
Hidden Layers
i. Take, as input, the outputs of previous layers
and passes them along to the next layers
ii. Two hidden layers are considered enough to
handle most problems
Output Layer
i. Generates prediction using the outputs of the
hidden layers as its inputs
ii. Backpropation in MLP is done for all weights
from input layer to output layer
2 Multi-Layer Perceptron
Step 1 - Initialization:
i. Weights – Random initialization
Step 1 - Initialization:
ii. Learning Rate and how it evolves across iterations
𝛼 = 0.5, 𝑑𝑒𝑐𝑎𝑦𝑖𝑛𝑔 0.05 𝑝𝑒𝑟 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 (𝑒𝑝𝑜𝑐ℎ)
iii. Setting the Activation Function
Quick note: In sklearn, the activation function of the output layer is already set for you
2 Training an MLP – Numeric Example
ID X1 X2 X3 X4 X5 y
1 0.2 0.8 0.3 0.0 0.7 1
Step 2 – Forward Pass:
𝑛
𝑧1∗ = 𝑤𝑖 𝑥𝑖 + 𝑤0 = 0.3 × 0.2 + −0.2 × 0.8 + 0.3 × 0.3 + −0.2 × 0 + 0.3 × 0.7 + 0.5 = 0.7
𝑖=1
1 1
𝑎11 = = = 0.668
1 + 𝑒 −𝑧1 1 + 𝑒 −0.7
1
𝑎21 = = 0.812
1 + 𝑒 −1.46
𝑧2∗ = 0.9 × 0.2 + 0.1 × 0.9 + 0.3 × 0.3 + 0.1 × 0 + 0.9 × 0.7 + 0.3 = 1.46
Hidden Layer Weights Output Layer Weights
𝒘𝟏𝟏𝟏 𝒘𝟏𝟏𝟐 𝒘𝟏𝟐𝟏 𝒘𝟏𝟐𝟐 𝒘𝟏𝟑𝟏 𝒘𝟏𝟑𝟐 𝒘𝟏𝟒𝟏 𝒘𝟏𝟒𝟐 𝒘𝟏𝟓𝟏 𝒘𝟏𝟓𝟐 𝒘𝑩𝟏 𝑩𝟏
𝟎𝟏 𝒘𝟎𝟐 𝒘𝟐𝟏𝟏 𝒘𝟐𝟏𝟐 𝒘𝑩𝟐
𝟎𝟏
0.3 0.9 -0.2 0.1 0.3 0.9 -0.2 0.1 0.3 0.9 0.5 0.3 0.8 -0.2 -0.2
2 Training an MLP – Numeric Example
ID X1 X2 X3 X4 X5 y
1 0.2 0.8 0.3 0.0 0.7 1
Step 2 – Forward Pass:
𝑎11 = 0.668
𝑛
1
𝑎12 = = 0.543 ≠ 𝑦 Need to update weights
1 + 𝑒 −0.172
𝑎21 = 0.812
Hidden Layer Weights Output Layer Weights
𝒘𝟏𝟏𝟏 𝒘𝟏𝟏𝟐 𝒘𝟏𝟐𝟏 𝒘𝟏𝟐𝟐 𝒘𝟏𝟑𝟏 𝒘𝟏𝟑𝟐 𝒘𝟏𝟒𝟏 𝒘𝟏𝟒𝟐 𝒘𝟏𝟓𝟏 𝒘𝟏𝟓𝟐 𝒘𝑩𝟏 𝑩𝟏
𝟎𝟏 𝒘𝟎𝟐 𝒘𝟐𝟏𝟏 𝒘𝟐𝟏𝟐 𝒘𝑩𝟐
𝟎𝟏
0.3 0.9 -0.2 0.1 0.3 0.9 -0.2 0.1 0.3 0.9 0.5 0.3 0.8 -0.2 -0.2
2 Training an MLP – Numeric Example
1 1
𝑤11 = 𝑤11 + (𝛼 × 𝐸𝑟𝑟𝑎11 × 𝑥1 ) = 0.3 + 0.5 × 0.021 × 0.2 = 0.302
After repeating the process for all other weights, we would have:
Output Layer Weights
𝒘𝟐𝟏𝟏 𝒘𝟐𝟏𝟐 𝒘𝑩𝟐
𝟎𝟏
0.8 -0.2 -0.2
0.838 -0.154 -0.143
𝒘𝟏𝟏𝟏 𝒘𝟏𝟏𝟐 𝒘𝟏𝟐𝟏 𝒘𝟏𝟐𝟐 𝒘𝟏𝟑𝟏 𝒘𝟏𝟑𝟐 𝒘𝟏𝟒𝟏 𝒘𝟏𝟒𝟐 𝒘𝟏𝟓𝟏 𝒘𝟏𝟓𝟐 𝒘𝑩𝟏
𝟎𝟏 𝒘𝑩𝟏
𝟎𝟐
0.3 0.9 -0.2 0.1 0.3 0.9 -0.2 0.1 0.3 0.9 0.5 0.3
0.302 0.900 -0.192 0.099 0.303 0.900 -0.200 0.100 0.307 0.899 0.511 0.299
2 Training an MLP – Numeric Example
2 Training an MLP – Numeric Example
2 Training an MLP – Numeric Example
2 Training an MLP with sklearn
mlp_model = MLPClassifier()
mlp_model.fit(X_train, y_train)
y_pred = mlp_model.predict(X_test)
2 Chapter 2 – Main Takeaways
2. Training an MLP:
i. Forward pass -> Pass the inputs at the input layer into outputs at the Output layer
ii. Compute error (loss) at the end
iii. Backpropagate the error along the network, updating weights as you go in order to
minimize the error
We’ve barely scratched the surface of ANNs