Ann MPDM Ii
Ann MPDM Ii
202223
2. Backpropagation
i. Is the algorithm that performs the backward pass:
i. assesses the contribution of each operation to the loss of the network
ii. Uses partial derivatives and the chain rule for that:
downstream gradient = upstream gradient x local gradient
MLP Training – End of Example
ID X1 X2 X3 X4 X5 y
1 0.2 0.8 0.3 0.0 0.7 1
𝒘𝟏𝟏𝟏 𝒘𝟏𝟏𝟐 𝒘𝟏𝟐𝟏 𝒘𝟏𝟐𝟐 𝒘𝟏𝟑𝟏 𝒘𝟏𝟑𝟐 𝒘𝟏𝟒𝟏 𝒘𝟏𝟒𝟐 𝒘𝟏𝟓𝟏 𝒘𝟏𝟓𝟐 𝒘𝑩𝟏
𝟎𝟏 𝒘𝑩𝟏
𝟎𝟐
0.3 0.9 -0.2 0.1 0.3 0.9 -0.2 0.1 0.3 0.9 0.5 0.3
0.302 0.900 -0.192 0.099 0.303 0.900 -0.200 0.100 0.307 0.899 0.511 0.299
A
G
E
N 1 Depth and Width of an ANN
D
A
2 Loss and Optimizers
3 Activation Functions
Toy Example:
2 class problem: separate green dots from
red dots
MLP:
- Tanh activation function in hidden layer
- SoftMax activation for output layer
1 Hidden Layer Width and Depth
Example:
- 1 hidden layer
- 1 neuron in the hidden layer
Example:
- 1 hidden layer
- 3 neuron in the hidden layer
1 Hidden Layer Width and Depth
Example:
- 1 hidden layer
- 5 neuron in the hidden layer
1 Hidden Layer Width and Depth
Example:
- 1 hidden layer
- 100 neuron in the hidden layer
1 Some observations
1. Neural networks with more neurons tend to represent more complicated functions
2. Models with more neurons tend to fit better to the training data
3. However, they seem to do that at the expense of much more disjointed decision regions –
possible overfitting
Toy Example:
2 class problem: separate green dots from red dots
MLP:
- Tanh activation function in each hidden layer
- SoftMax activation for output layer
1 Hidden Layer Width and Depth
Example:
- 1 hidden layer
- 2 neurons in each hidden layer
1 Hidden Layer Width and Depth
Example:
- 2 hidden layer
- 2 neuron in the hidden layer
1 Hidden Layer Width and Depth
Example:
- 5 hidden layers
- 2 neuron in the hidden layer
1 Some observations
1. Increasing number of layers and neurons makes the model more powerful:
i. A good starting point is having 2 hidden layers
ii. A number of nodes in each hidden layer be inferior to the number of features
1. Increasing number of layers and neurons makes the model more powerful:
i. A good starting point is having 2 hidden layers
ii. A number of nodes in each hidden layer be inferior to the number of features
𝛼 2
𝑅𝑒𝑔 = ԡ𝑊 ԡ
𝐸𝑟𝑟𝑗 = 𝑦ො𝑗 (1 − 𝑦ො𝑗 ) 𝐸𝑟𝑟𝑘 𝑤𝑗𝑘 2𝑛 2
𝑘
2 Loss and Optimizers
Weight adjustments:
1. Minimize loss function
2. Gradient descent
3. No guarantee of global minima
4. Possibility to become stuck on
local optima
Need to define:
1. Optimization Algorithm
2. Step-size (Learning Rate)
made at every point or every few points ii. Regarded as good starting point due its
versatility
Mini-Batch Gradient Descent:
• sklearn makes weight updates using,
by default, an update at every 200
samples
- 𝐿𝑅_𝑟𝑎𝑡𝑒_𝑖𝑛𝑖𝑡 – The initial value for the Learning rate (0.01 is a good starting point)
- t – number of iterations
- Adaptive
- The learning rate keeps constant as long as training loss keeps decreasing for a specific threshold
- In Sklearn, if the loss do not decrease by at least a specific threshold (tol) for two consecutive
epochs, the current learning rate is divided by 5
2 Chapter 2 – Main Takeaways
2. Practical Considerations:
i. Adam is flexible and seems to generalize rather well for most problems
ii. SGD with an appropriate Learning Rate and LR decay may outperform Adam
iii. Fine-tuning is more experimental than mathematical
3
Activation Functions
A feedforward artificial neural network with multiple layers of perceptrons,
capable of modelling complex non-linear relationships between inputs and outputs.
3 Activation Functions
Activation functions introduce non-linearity into the output of a neural network's neurons, which is
essential for the network to learn complex mappings between input and output data.
3 Activation Functions
Main Attributes:
i. Squashes everything into a range [0,1]
ii. Easy to interpret (result is a probability)
Potential Problems
i. Not zero-centered
ii. Saturated neurons “kill” the gradients
iii. Exponential function may be computationally
“expensive”
3 Activation Functions
Main Attributes:
i. Squashes everything into a range [-1,1]
ii. Zero centered
Potential Problems
i. Saturated neurons still “kill” the gradients
3 Activation Functions
Main Attributes:
i. Does not saturate in positive region
ii. Computationally efficient
iii. Faster convergence than sigmoid/tanh in
practice (e.g. 6x)
iv. First used in 2012
Potential Problems
i. Not zero-centered output
ii. No gradient when x < 0
3 Chapter 3 – Main Takeaways
1. Activation Functions:
i. Both Sigmoid and Tanh have the issue of possible saturation:
i. Inability to differentiate past a certain value of X Loss
ii. ReLU does not saturate for positive values and is computationally very efficient
iii. Tends to converge rapidly
iv. But, it does not have a gradient for negative values
2. Practical Considerations:
i. Use ReLU as the activation function for hidden layers as a good starting point
ii. Sigmoid and Tanh viability is problem specific and should be assessed in optimization
iii. Outside sklearn, there is a myriad of other ReLU-based activations worth exploring
4