0% found this document useful (0 votes)
73 views42 pages

Ann MPDM Ii

The document discusses artificial neural networks and how increasing the number of layers and neurons can increase a model's power but also its tendency to overfit, highlighting the need to balance model complexity. It also covers loss functions, gradient descent, and common optimization algorithms like stochastic gradient descent and Adam that are used to minimize loss by updating weights, as well as the importance of tuning the learning rate over time.

Uploaded by

sidra shafiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views42 pages

Ann MPDM Ii

The document discusses artificial neural networks and how increasing the number of layers and neurons can increase a model's power but also its tendency to overfit, highlighting the need to balance model complexity. It also covers loss functions, gradient descent, and common optimization algorithms like stochastic gradient descent and Adam that are used to minimize loss by updating weights, as well as the importance of tuning the learning rate over time.

Uploaded by

sidra shafiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Master in Information Management

202223

Artificial Neural Networks


Ricardo Santos
In the previous episode…
Main topics from last week

1. Training a Perceptron or a Multi-Layer Perceptron:


i. Starts with the Forward Pass
ii. Requires the computation of some form of error
iii. Uses the Backward Pass to go back through the network and adjust weights

2. Backpropagation
i. Is the algorithm that performs the backward pass:
i. assesses the contribution of each operation to the loss of the network
ii. Uses partial derivatives and the chain rule for that:
downstream gradient = upstream gradient x local gradient
MLP Training – End of Example

ID X1 X2 X3 X4 X5 y
1 0.2 0.8 0.3 0.0 0.7 1

Prediction after Forward Pass: 𝑦ො = 0.543

Output Layer Weights


𝒘𝟐𝟏𝟏 𝒘𝟐𝟏𝟐 𝒘𝑩𝟐
𝟎𝟏
0.8 -0.2 -0.2
0.838 -0.154 -0.143

Hidden Layer Weights

𝒘𝟏𝟏𝟏 𝒘𝟏𝟏𝟐 𝒘𝟏𝟐𝟏 𝒘𝟏𝟐𝟐 𝒘𝟏𝟑𝟏 𝒘𝟏𝟑𝟐 𝒘𝟏𝟒𝟏 𝒘𝟏𝟒𝟐 𝒘𝟏𝟓𝟏 𝒘𝟏𝟓𝟐 𝒘𝑩𝟏
𝟎𝟏 𝒘𝑩𝟏
𝟎𝟐
0.3 0.9 -0.2 0.1 0.3 0.9 -0.2 0.1 0.3 0.9 0.5 0.3
0.302 0.900 -0.192 0.099 0.303 0.900 -0.200 0.100 0.307 0.899 0.511 0.299
A
G
E
N 1 Depth and Width of an ANN
D
A
2 Loss and Optimizers

3 Activation Functions

4 Other Cool Structures and Applications


1

Depth and Width of an ANN


You can play with different Neural Network architectures here
1 Hidden Layer Width and Depth

Observation: What does increasing the


number of neurons actually do?

Toy Example:
2 class problem: separate green dots from
red dots

MLP:
- Tanh activation function in hidden layer
- SoftMax activation for output layer
1 Hidden Layer Width and Depth

Observation: What does increasing the


number of neurons actually do?

Example:
- 1 hidden layer
- 1 neuron in the hidden layer

What happens as we start to increase the


number of neurons in this hidden layer?
1 Hidden Layer Width and Depth

Observation: What does increasing the


number of neurons actually do?

Example:
- 1 hidden layer
- 3 neuron in the hidden layer
1 Hidden Layer Width and Depth

Observation: What does increasing the


number of neurons actually do?

Example:
- 1 hidden layer
- 5 neuron in the hidden layer
1 Hidden Layer Width and Depth

Observation: What does increasing the


number of neurons actually do?

Example:
- 1 hidden layer
- 100 neuron in the hidden layer
1 Some observations

1. Neural networks with more neurons tend to represent more complicated functions
2. Models with more neurons tend to fit better to the training data
3. However, they seem to do that at the expense of much more disjointed decision regions –
possible overfitting

3 neurons 5 neurons 100 neurons


1 Hidden Layer Width and Depth

Observation: What does increasing the number


of layers, keeping neurons constant, actually do?

Toy Example:
2 class problem: separate green dots from red dots

MLP:
- Tanh activation function in each hidden layer
- SoftMax activation for output layer
1 Hidden Layer Width and Depth

Observation: What does increasing the number


of layers, keeping neurons constant, actually do?

Example:
- 1 hidden layer
- 2 neurons in each hidden layer
1 Hidden Layer Width and Depth

Observation: What does increasing the number


of layers, keeping neurons constant, actually do?

Example:
- 2 hidden layer
- 2 neuron in the hidden layer
1 Hidden Layer Width and Depth

Observation: What does increasing the number


of layers, keeping neurons constant, actually do?

Example:
- 5 hidden layers
- 2 neuron in the hidden layer
1 Some observations

1. Deeper networks seem to outperform shallow single-layer networks


2. There seems to be a cap to what the number of neurons at each layer seem to do

2 layers 3 hidden layers 5 hidden layers


1 Some observations

1. Deeper networks seem to outperform shallow single-layer networks


2. There seems to be a cap to what the number of neurons at each layer seem to do
3. Higher proneness to overfit with larger number of neurons per layer
4. Need to find a balance between increasing number of layers and number of neurons in layer

2 layers 2 layers 2 layers


2 neurons per layer 3 neurons per layer 5 neurons per layer
1 Chapter 1 – Main Takeaways

1. Increasing number of layers and neurons makes the model more powerful:
i. A good starting point is having 2 hidden layers
ii. A number of nodes in each hidden layer be inferior to the number of features

2. Practical Observations from Literature:


i. Experiment using MNIST:
i. Create different ANNs with different sizes
ii. Measurement of test loss of these ANNs:
i. Mean
ii. Standard deviation
1 Chapter 1 – Main Takeaways

1. Increasing number of layers and neurons makes the model more powerful:
i. A good starting point is having 2 hidden layers
ii. A number of nodes in each hidden layer be inferior to the number of features

2. Practical Observations from Literature:


i. Smaller networks exhibit fewer, easy to converge, minima that have high loss but are
likely local
ii. Larger networks have many different possible solutions, most of them tending to result in
low loss
iii. However, to reach these minima, the network tends to require a much larger amount of
data, thus making smaller neural networks as better suited for simple problems
2

Loss and Optimizers


Gradient Descent and Variants
2 Loss and Optimizers

What were we calculating exactly? Another way of thinking about it:


On the Perceptron: i. Start from dataset with predictors X and target y
ii. Define a score function:
𝑒𝑟𝑟𝑜𝑟 𝜀 = 𝑦 − 𝑦ො
i. Does output match the expected label?
On the MLP (sigmoid activation): iii. Define loss function:
Output layer: Cross-Entropy Loss
𝑛
𝐸𝑟𝑟𝑗 = 𝑦ො𝑗 (1 − 𝑦ො𝑗 )(𝑦𝑗 − 𝑦ො𝑗 ) 1
𝐿𝑜𝑠𝑠 𝑦,
ො 𝑦, 𝑊 = − ෍(𝑦𝑖 ln 𝑦ෝ𝑖 + (1 − 𝑦𝑖 )ln (1 − 𝑦𝑖 )) + 𝑅𝑒𝑔
𝑛
Hidden layer: 𝑖=0

𝛼 2
𝑅𝑒𝑔 = ԡ𝑊 ԡ
𝐸𝑟𝑟𝑗 = 𝑦ො𝑗 (1 − 𝑦ො𝑗 ) ෍ 𝐸𝑟𝑟𝑘 𝑤𝑗𝑘 2𝑛 2
𝑘
2 Loss and Optimizers

Goal is to update W to minimize loss

Simple example 2-variable loss landscape


2 Loss and Optimizers

Weight adjustments:
1. Minimize loss function
2. Gradient descent
3. No guarantee of global minima
4. Possibility to become stuck on
local optima

Need to define:
1. Optimization Algorithm
2. Step-size (Learning Rate)

Author: Angshuman Saha http://web.archive.org/web/20091026213444/http://www.geocities.com/adotsaha/NNinExcel.html


2 Optimization Algorithms

Stochastic Gradient Descent: Adam:


Faster training as weight updates can be i. Adaptative optimizer with momentum

made at every point or every few points ii. Regarded as good starting point due its
versatility
Mini-Batch Gradient Descent:
• sklearn makes weight updates using,
by default, an update at every 200
samples

How loss evolves depending on the


optimizer used in sklearn (Source)
2 The learning rate

What is the learning rate?


- Determines the step size taken during optimization in the direction of minimizing the loss function
- Problem dependent as SGD does not know the actual error landscape
2 The learning rate

How does Learning Rate evolve over time? (SGD)


- Constant
- The learning rate is constant in all iterations
- Invscaling
𝐿𝑅_𝑟𝑎𝑡𝑒_𝑖𝑛𝑖𝑡
𝐿𝑅 = −
𝑝𝑜𝑤(𝑡, 𝑝𝑜𝑤𝑒𝑟_𝑡)

- 𝐿𝑅_𝑟𝑎𝑡𝑒_𝑖𝑛𝑖𝑡 – The initial value for the Learning rate (0.01 is a good starting point)
- t – number of iterations
- Adaptive
- The learning rate keeps constant as long as training loss keeps decreasing for a specific threshold
- In Sklearn, if the loss do not decrease by at least a specific threshold (tol) for two consecutive
epochs, the current learning rate is divided by 5
2 Chapter 2 – Main Takeaways

1. Loss function and Gradient Descent:


i. Loss, or error, are computed as a measure as a function that is meant to be minimized via
gradient descent (or better yet, one of the Gradient Descent algorithms that exist)
ii. There is no guarantee that optimization will ever reach a global minimum
iii. SGD and Adam are two alternative algorithms available in sklearn for weight adjustment

2. Practical Considerations:
i. Adam is flexible and seems to generalize rather well for most problems
ii. SGD with an appropriate Learning Rate and LR decay may outperform Adam
iii. Fine-tuning is more experimental than mathematical
3

Activation Functions
A feedforward artificial neural network with multiple layers of perceptrons,
capable of modelling complex non-linear relationships between inputs and outputs.
3 Activation Functions

Activation functions introduce non-linearity into the output of a neural network's neurons, which is
essential for the network to learn complex mappings between input and output data.
3 Activation Functions

Main Attributes:
i. Squashes everything into a range [0,1]
ii. Easy to interpret (result is a probability)

Potential Problems
i. Not zero-centered
ii. Saturated neurons “kill” the gradients
iii. Exponential function may be computationally
“expensive”
3 Activation Functions

Main Attributes:
i. Squashes everything into a range [-1,1]
ii. Zero centered

Potential Problems
i. Saturated neurons still “kill” the gradients
3 Activation Functions

Main Attributes:
i. Does not saturate in positive region
ii. Computationally efficient
iii. Faster convergence than sigmoid/tanh in
practice (e.g. 6x)
iv. First used in 2012

Potential Problems
i. Not zero-centered output
ii. No gradient when x < 0
3 Chapter 3 – Main Takeaways

1. Activation Functions:
i. Both Sigmoid and Tanh have the issue of possible saturation:
i. Inability to differentiate past a certain value of X Loss
ii. ReLU does not saturate for positive values and is computationally very efficient
iii. Tends to converge rapidly
iv. But, it does not have a gradient for negative values

2. Practical Considerations:
i. Use ReLU as the activation function for hidden layers as a good starting point
ii. Sigmoid and Tanh viability is problem specific and should be assessed in optimization
iii. Outside sklearn, there is a myriad of other ReLU-based activations worth exploring
4

Other Cool Structures and Applications


CNN, RNN & GANs (outside scope of course)
4 Image Classification with CNNs
4 Sequences with RNNs
4 Sequences with RNNs

Depending on the application, RNN architectures may differ significantly


4 Sequences with RNNs
4 General Adversarial Networks

Image generation from text prompts (source)


References
- https://github.com/Atcold/NYU-DLSP21
- https://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes03-neuralnets.pdf
- https://towardsdatascience.com/understanding-backpropagation-abcc509ca9d0
- http://cs231n.stanford.edu/schedule.html
- http://cs231n.stanford.edu/slides/2023/lecture_7.pdf
- https://cs230.stanford.edu/syllabus/
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge, MA: MIT Press.

- For a good place to start in with Deep Learning: http://introtodeeplearning.com/


Thank You!

Morada: Campus de Campolide, 1070-312 Lisboa, Portugal


Tel: +351 213 828 610 | Fax: +351 213 828 611

You might also like