L10 Neural Network
L10 Neural Network
1
Objectives
• Understand neural network
• What are activation functions?
• How backpropagation work in neural network?
• Understand feedforward neural network
2
Motivation for Neural Network
• Comprised of
• Neurons will pass input values through
functions and output the result
• Weights carry values between neurons
• 3 main types of layers
• Input layer A 3-layer neural net with 3 input units, 4 hidden units in the 1st and
2nd hidden layer and 1 output unit
• Hidden layer(s) Naming conventions: a N-layer neural network:
• N-1 layers of hidden units
• Output layer • 1 output layer 4
Basic Neuron VisualizationSome form of
computation
Neuron
transforms the input
Weight outputs the
𝑥1 𝑤 with activation
transformed
1
data
𝑤2
Data from 𝑥2 neuron
previous
layer 𝑤3
𝑥3
Mathematical Model of the Neuron in a
Neural Network Activation function
𝑥1 𝑤1
𝑤1 𝑥 1 Cell body 𝑓
( ∑ 𝑤 𝑖 𝑥 𝑖 +𝑏
)
𝑖
𝑥 2 𝑤 2 𝑤 2 𝑥2
𝑧=∑ 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑓 Output
𝑖
𝑤3 𝑥 3
𝑥 3 𝑤3
𝑏
𝑧=∑ 𝑤 𝑖 𝑥 𝑖 +𝑏=𝑤1 𝑥1 +𝑤 2 𝑥 2+𝑤 3 𝑥 3 +𝑏
𝑖
Another form of single node
visualization
McCulloch-Pitts Model
7
In Vector Notation
• Bias,
• Activation function,
• Net input,
9
Example Neuron Computation
𝑧=∑ 𝑤 𝑖 𝑥 𝑖 +𝑏
𝑖
𝑤2 =3
= 0.2
𝑖
𝑤3 =−1
𝑥 3=0.3
𝑧=∑ 𝑤 𝑖 𝑥 𝑖 +𝑏
𝑖
𝑤2 =3
= 0.2
𝑖
𝑤3 =−1
𝑥 3=0.3
= 0.9 1
𝑤1 =2
(
𝑖
)
𝑓 ∑ 𝑤 𝑖 𝑥 𝑖 +𝑏 = 𝑓 ( 𝑧 )=
1+𝑒
−2.6
= 0.2
𝑤2 =3 𝑧=∑ 𝑤 𝑖 𝑥 𝑖 +𝑏 =
0.93
𝑖
𝑤3 =−1
𝑥 3=0.3 Neuron
output
13
Feedforward Neural Network
𝜎 𝜎
𝑥1 ^
𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3
𝜎 𝜎
14
Feedforward Neural Network
Input layer
𝜎 𝜎
𝑥1 ^
𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3
𝜎 𝜎
15
Feedforward Neural Network
Hidden Hidden
layer layer
𝜎 𝜎
𝑥1 ^
𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3
𝜎 𝜎
16
Feedforward Neural Network
Output layer
𝜎 𝜎
𝑥1 ^
𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3
𝜎 𝜎
17
Feedforward Neural Network
( 1) ( 2) ( 3)
Weights are represented
𝑊 𝑊 𝑊
by matrices
𝜎 𝜎
𝑥1 ^
𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3
𝜎 𝜎
18
Feedforward Neural Network
Net Input (sum of weighted
( 2) ( 3)
inputs, before activation
𝜎 𝜎
𝑧 𝑧 (4 )
𝑧
function)
𝑥1 ^
𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3
𝜎 𝜎
19
Feedforward Neural Network
𝑎
(2 )
𝑎
(3 )
Activations (output of
𝜎 𝜎
(1)
𝑎 𝑎
( 4)
neurons to next layer)
𝑥1 ^
𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3
𝜎 𝜎
20
Matrix representation of computation
, where 𝑧
( 2)
𝑎
(2 )
𝜎
( 1)
𝑊
is a 3x4 matrix
and is a 4-vector
𝑥1
𝜎
and is a 4-vector
𝑥2
𝜎
𝑥3
𝜎
21
Matrix representation of computation
For a single training instance (data point) 𝑧
( 2)
𝑎
(2 )
𝜎
( 1)
𝑊
• Input: vector x (a row vector of length 3)
𝜎
( 1)
• But the equations look the same! 𝑊
• Input: matrix x (a nx3 matrix) (each row a single instance)
𝑥1
• Output: vector (a nx3 matrix) (each row a single prediction)
𝜎
, 𝑥2
, 𝜎
, 𝑥3
Next step will need to adjust the weights to learn
𝜎
from data.
23
How to Train a Neural Net?
• Put in Training inputs, get the
output
• Compare output to correct
Input
answers: Look at loss function J (Feature
Output
(Label)
Vector)
• Adjust and repeat!
• Backpropagation tells us how to
make a single adjustment using
calculus.
24
How to Train a Neural Net?
• Using Gradient Descent!
1. Make prediction
2. Calculate Loss
3. Calculate gradient of the loss function w.r.t. parameters
4. Update parameters by taking a step in the opposite direction
5. Iterate
25
Feedforward Neural Network
4. Evaluate
2. Calculate each layer
𝜎 𝜎 3. Get output
^
1. Pass in input
𝑥1 𝑦1 𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2 𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3 𝑦3
𝜎 𝜎
26
How to Train a Neural Net?
• How could we change the weights to make our Loss Function
lower?
• Think of neural net as a function F: X Y
• F is a complex computation involving many weights
• Given the structure, the weights “define” the function F (and
therefore define our model)
• Loss Function is J(y,F(x))
27
How to Train a Neural Net?
• Get for every weight in the network.
• This tells us what direction to adjust each if we want to
lower our loss function.
• Make an adjustment and repeat!
28
How to Train a Neural Net?
(1) (2) (3)
𝑊 𝑊 𝑊 Want
𝜎 𝜎
𝑥1 ^
𝑦1 𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2 𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3 𝑦3
𝜎 𝜎
29
How to Train a Neural Net?
• Get for every weight in the network
• Use calculus, chain rule, etc.
• Functions are chosen to have “nice” derivatives
• Numerical issues to be considered
30
How to Train a Neural Net?
• Recall that:
• Though they appear complex, above are easy to
compute!
31
Backpropagation
𝜕 𝐽 (𝑦𝑖, ^
𝑦𝑖)
𝜕 𝐽 (𝑦𝑖, ^
𝑦𝑖) 𝜕 𝐽 (𝑦𝑖, ^
𝑦𝑖) Want
𝜕𝑊 2
𝜕𝑊 1
𝜎 𝜎 𝜕𝑊 3
𝑥1 ^
𝑦1 𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2 𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3 𝑦3
𝜎 𝜎
Update parameters by taking a step in the
opposite direction 32
Activation Functions – Sigmoid
Function
• curve looks like a S-shape
“sigmoid” function:
• Sigmoid output value between
0 to 1
• used for models where we have
to predict the probability as an
output (0 to 1)
33
Activation Functions – Softmax
Function
𝑧
𝑒 𝑗
( ) 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑧 𝑗 = 𝑘
𝑓𝑜𝑟 𝑗 =1 , … , 𝑘
∑ 𝑒
𝑧𝑘
34
Activation Functions – Hyperbolic tangent
function
• Better than sigmoid function
• negative inputs will be mapped
strongly to negative, and the zero
inputs will be mapped near zero in
the tanh graph
• mainly used classification between
two classes
35
Activation Functions – Rectified Linear Unit
(ReLU)
• most used activation function in the world
right now
• used in almost all the convolutional neural
networks or deep learning
• half rectified (from bottom)
• f(z) is zero when z is less than zero and f(z)
is equal to z when z is above or equal to zero
• Range: [ 0 to infinity)
36
Activation Functions – “Leaky” Rectified Linear Unit
(LReLU or LReL)
• the leak helps to increase the
range of the ReLU function α is
0.01 or so
• range of the Leaky ReLU is (-infinity
to infinity)
37
Choice of Activation Functions
• Sigmoid functions and their combinations generally work better in the case of
classifiers
• Sigmoids and tanh functions are sometimes avoided due to the vanishing gradient
problem
• ReLU function is a general activation function and is used in most cases these days
• If we encounter a case of dead neurons in our networks the leaky ReLU function is the
best choice
• Always keep in mind that ReLU function should only be used in the hidden layers
• As a rule of thumb, you can begin with using ReLU function and then move over to
other activation functions in case ReLU doesn’t provide with optimum results
38
Activation Functions
• Each neural network had
three hidden layers with
three units in each one.
• The only difference was
the activation function.
• Learning rate: 0.03,
regularization: L2.
39
Dropout overview
• Neural networks can represent extremely complex data
• Very large number of parameters allows NNs to memorize a dataset
40
Dropout model
41
Dropout layer
Gates
42
Knocking out and rescaling neurons
43
Knocking out and rescaling neurons
44
Knocking out and rescaling neurons
This ensures that the expected value of the weights stays the same
at run time
45
Early Stopping
• Another, more heuristically approach to regularization is early
stopping.
• This refers to choosing some rules after which to stop training.
• Example:
• Check the validation log-loss every 10 epochs.
• If it is higher than it was last time, stop and use the previous model (i.e.,
from 10 epochs previous)
46
Concept of a “pseudo-ensemble”
• a collection of child models
spawned from a parent model by
perturbing it according to some
noise process
• in a deep neural network, it trains
a pseudo-ensemble of child
subnetworks generated by
randomly masking nodes in the
parent network
47
Model 1
48
Model 2
49
Model 3 etc.
50
List of all the hyperparameters you can
tweak in a basic Multilayer Perceptron
1. the
(MLP) number of hidden layers
2. the number of neurons in each hidden layer
3. the activation function used in each hidden layer and
in the output layer. Generally, ReLU activation
function (or one of its variants) is a good default for
the hidden layers.
4. For the output layer, in general you will want the
logistic activation function for binary classification,
the softmax activation function for multiclass
classification, or no activation function for regression.
5. If the MLP overfits the training data, you can try
Reference:
• https://levelup.gitconnected.com/vanishing-and-
exploding-gradients-ae7fb88f3b66
• https://towardsdatascience.com/activation-functions-
neural-networks-1cbd9f8d91d6
• https://medium.com/arteos-ai/the-differences-between-
sigmoid-and-softmax-activation-function-12adee8cf322
52