0% found this document useful (0 votes)
23 views52 pages

L10 Neural Network

Uploaded by

black hello
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views52 pages

L10 Neural Network

Uploaded by

black hello
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Neural Network

1
Objectives
• Understand neural network
• What are activation functions?
• How backpropagation work in neural network?
• Understand feedforward neural network

2
Motivation for Neural Network

• Use biology as inspiration for mathematical


model
• Get signals from previous neurons
• Generate signals (or not) according to inputs
• Pass signals on to next neurons
• By layering many neurons, can create complex
model
3
Overview of Neural Network
Input Hidden Hidden
• Neural network layer layer 1 layer 2 Output
(Feature layer
• a function vector)
(Label)

• Comprised of
• Neurons will pass input values through
functions and output the result
• Weights carry values between neurons
• 3 main types of layers
• Input layer A 3-layer neural net with 3 input units, 4 hidden units in the 1st and
2nd hidden layer and 1 output unit
• Hidden layer(s) Naming conventions: a N-layer neural network:
• N-1 layers of hidden units
• Output layer • 1 output layer 4
Basic Neuron VisualizationSome form of
computation
Neuron
transforms the input
Weight outputs the
𝑥1 𝑤 with activation
transformed
1
data

𝑤2
Data from 𝑥2 neuron
previous
layer 𝑤3

𝑥3
Mathematical Model of the Neuron in a
Neural Network Activation function
𝑥1 𝑤1
𝑤1 𝑥 1 Cell body 𝑓
( ∑ 𝑤 𝑖 𝑥 𝑖 +𝑏
)
𝑖
𝑥 2 𝑤 2 𝑤 2 𝑥2
𝑧=∑ 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑓 Output
𝑖
𝑤3 𝑥 3
𝑥 3 𝑤3

𝑏
𝑧=∑ 𝑤 𝑖 𝑥 𝑖 +𝑏=𝑤1 𝑥1 +𝑤 2 𝑥 2+𝑤 3 𝑥 3 +𝑏
𝑖
Another form of single node
visualization

McCulloch-Pitts Model

7
In Vector Notation

• Bias,
• Activation function,
• Net input,

• Output to the next layer,


8
Relation to Logistic Regression
When we choose:
the “sigmoid” function:

Then a neuron is simply a “unit” of logistic


regression.

9
Example Neuron Computation

= 0.9 Sigmoid activation function


𝑤1 =2
1
( )
𝑓 ∑ 𝑤 𝑖 𝑥 𝑖 +𝑏 = 𝑓 ( 𝑧 )=
1+𝑒
−𝑥

𝑧=∑ 𝑤 𝑖 𝑥 𝑖 +𝑏
𝑖
𝑤2 =3
= 0.2
𝑖
𝑤3 =−1

𝑥 3=0.3

𝑧=∑ 𝑤 𝑖 𝑥 𝑖 +𝑏=𝑤1 𝑥1 +𝑤 2 𝑥 2+𝑤 3 𝑥 3 +𝑏


= 0.5
𝑖
Example Neuron Computation

= 0.9 Sigmoid activation function


𝑤1 =2
1
( )
𝑓 ∑ 𝑤 𝑖 𝑥 𝑖 +𝑏 = 𝑓 ( 𝑧 )=
1+𝑒
−𝑥

𝑧=∑ 𝑤 𝑖 𝑥 𝑖 +𝑏
𝑖
𝑤2 =3
= 0.2
𝑖
𝑤3 =−1

𝑥 3=0.3

= 0.5 𝑧=∑ 𝑤 𝑖 𝑥 𝑖 +𝑏=2 ( .9 ) +3 ( .2 ) + (− 1 ) (.3 ) +.5=2.6


𝑖
Example Neuron Computation
Sigmoid activation function

= 0.9 1
𝑤1 =2
(
𝑖
)
𝑓 ∑ 𝑤 𝑖 𝑥 𝑖 +𝑏 = 𝑓 ( 𝑧 )=
1+𝑒
−2.6

= 0.2
𝑤2 =3 𝑧=∑ 𝑤 𝑖 𝑥 𝑖 +𝑏 =
0.93
𝑖
𝑤3 =−1

𝑥 3=0.3 Neuron
output

= 0.5 𝑧=∑ 𝑤 𝑖 𝑥 𝑖 +𝑏=2 ( .9 ) +3 ( .2 ) + (− 1 ) (.3 ) +.5=2.6


𝑖
Why Neural Network?
• Why not just use a single neuron? Why do we need a larger network?
• A single neuron (like logistic regression) only permits a linear decision
boundary.
• Most real-world problems are considerably more complicated!

13
Feedforward Neural Network

𝜎 𝜎
𝑥1 ^
𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3
𝜎 𝜎
14
Feedforward Neural Network
Input layer

𝜎 𝜎
𝑥1 ^
𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3
𝜎 𝜎
15
Feedforward Neural Network
Hidden Hidden
layer layer

𝜎 𝜎
𝑥1 ^
𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3
𝜎 𝜎
16
Feedforward Neural Network
Output layer

𝜎 𝜎
𝑥1 ^
𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3
𝜎 𝜎
17
Feedforward Neural Network
( 1) ( 2) ( 3)
Weights are represented
𝑊 𝑊 𝑊
by matrices
𝜎 𝜎
𝑥1 ^
𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3
𝜎 𝜎
18
Feedforward Neural Network
Net Input (sum of weighted
( 2) ( 3)
inputs, before activation
𝜎 𝜎
𝑧 𝑧 (4 )
𝑧
function)
𝑥1 ^
𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3
𝜎 𝜎

19
Feedforward Neural Network
𝑎
(2 )
𝑎
(3 )
Activations (output of
𝜎 𝜎
(1)
𝑎 𝑎
( 4)
neurons to next layer)
𝑥1 ^
𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3
𝜎 𝜎

20
Matrix representation of computation
, where 𝑧
( 2)
𝑎
(2 )

𝜎
( 1)
𝑊
is a 3x4 matrix

and is a 4-vector
𝑥1
𝜎
and is a 4-vector
𝑥2
𝜎
𝑥3
𝜎
21
Matrix representation of computation
For a single training instance (data point) 𝑧
( 2)
𝑎
(2 )

𝜎
( 1)
𝑊
• Input: vector x (a row vector of length 3)

• Output: vector (a row vector of length 3) 𝑥1


𝜎
,
𝑥2
,
𝜎
,
𝑥3
𝜎
22
Matrix representation of computation
• In practice, we do these computation for many data points at the
same time, by “stacking” the rows into a matrix.
( 2) (2 )
𝑧 𝑎

𝜎
( 1)
• But the equations look the same! 𝑊
• Input: matrix x (a nx3 matrix) (each row a single instance)
𝑥1
• Output: vector (a nx3 matrix) (each row a single prediction)
𝜎
, 𝑥2
, 𝜎
, 𝑥3
Next step will need to adjust the weights to learn
𝜎
from data.

23
How to Train a Neural Net?
• Put in Training inputs, get the
output
• Compare output to correct
Input
answers: Look at loss function J (Feature
Output
(Label)
Vector)
• Adjust and repeat!
• Backpropagation tells us how to
make a single adjustment using
calculus.

24
How to Train a Neural Net?
• Using Gradient Descent!
1. Make prediction
2. Calculate Loss
3. Calculate gradient of the loss function w.r.t. parameters
4. Update parameters by taking a step in the opposite direction
5. Iterate

25
Feedforward Neural Network
4. Evaluate
2. Calculate each layer

𝜎 𝜎 3. Get output

^
1. Pass in input
𝑥1 𝑦1 𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2 𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3 𝑦3
𝜎 𝜎
26
How to Train a Neural Net?
• How could we change the weights to make our Loss Function
lower?
• Think of neural net as a function F: X  Y
• F is a complex computation involving many weights
• Given the structure, the weights “define” the function F (and
therefore define our model)
• Loss Function is J(y,F(x))
27
How to Train a Neural Net?
• Get for every weight in the network.
• This tells us what direction to adjust each if we want to
lower our loss function.
• Make an adjustment and repeat!

28
How to Train a Neural Net?
(1) (2) (3)
𝑊 𝑊 𝑊 Want
𝜎 𝜎
𝑥1 ^
𝑦1 𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2 𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3 𝑦3
𝜎 𝜎
29
How to Train a Neural Net?
• Get for every weight in the network
• Use calculus, chain rule, etc.
• Functions are chosen to have “nice” derivatives
• Numerical issues to be considered

30
How to Train a Neural Net?

• Recall that:
• Though they appear complex, above are easy to
compute!
31
Backpropagation
𝜕 𝐽 (𝑦𝑖, ^
𝑦𝑖)
𝜕 𝐽 (𝑦𝑖, ^
𝑦𝑖) 𝜕 𝐽 (𝑦𝑖, ^
𝑦𝑖) Want
𝜕𝑊 2
𝜕𝑊 1
𝜎 𝜎 𝜕𝑊 3

𝑥1 ^
𝑦1 𝑦1
𝜎 𝜎
𝑥2 ^
𝑦2 𝑦2
𝜎 𝜎
𝑥3 ^
𝑦3 𝑦3
𝜎 𝜎
Update parameters by taking a step in the
opposite direction 32
Activation Functions – Sigmoid
Function
• curve looks like a S-shape
“sigmoid” function:
• Sigmoid output value between
0 to 1
• used for models where we have
to predict the probability as an
output (0 to 1)

33
Activation Functions – Softmax
Function
𝑧
𝑒 𝑗

( ) 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑧 𝑗 = 𝑘
𝑓𝑜𝑟 𝑗 =1 , … , 𝑘
∑ 𝑒
𝑧𝑘

• For multiclass classification


𝑘=1

• takes vectors of real numbers as inputs, and normalizes them into a


probability distribution proportional to the exponentials of the input
numbers
• after applying Softmax, each element will be in the range of 0 to 1, and the
elements will add up to 1
• output can be interpreted as a probability distribution

34
Activation Functions – Hyperbolic tangent
function
• Better than sigmoid function
• negative inputs will be mapped
strongly to negative, and the zero
inputs will be mapped near zero in
the tanh graph
• mainly used classification between
two classes
35
Activation Functions – Rectified Linear Unit
(ReLU)
• most used activation function in the world
right now
• used in almost all the convolutional neural
networks or deep learning
• half rectified (from bottom)
• f(z) is zero when z is less than zero and f(z)
is equal to z when z is above or equal to zero
• Range: [ 0 to infinity)

36
Activation Functions – “Leaky” Rectified Linear Unit
(LReLU or LReL)
• the leak helps to increase the
range of the ReLU function α is
0.01 or so
• range of the Leaky ReLU is (-infinity
to infinity)

37
Choice of Activation Functions
• Sigmoid functions and their combinations generally work better in the case of
classifiers
• Sigmoids and tanh functions are sometimes avoided due to the vanishing gradient
problem
• ReLU function is a general activation function and is used in most cases these days

• If we encounter a case of dead neurons in our networks the leaky ReLU function is the
best choice
• Always keep in mind that ReLU function should only be used in the hidden layers

• As a rule of thumb, you can begin with using ReLU function and then move over to
other activation functions in case ReLU doesn’t provide with optimum results
38
Activation Functions
• Each neural network had
three hidden layers with
three units in each one.
• The only difference was
the activation function.
• Learning rate: 0.03,
regularization: L2.
39
Dropout overview
• Neural networks can represent extremely complex data
• Very large number of parameters allows NNs to memorize a dataset

• We want to regularize (smooth) their solution


• Prevent single neurons from dominating
• Require other neurons to be more flexible
• Randomly zero the output of neurons during training
• Other neurons have to adapt

40
Dropout model

41
Dropout layer
Gates

42
Knocking out and rescaling neurons

During training, we randomly drop each neuron with probability

43
Knocking out and rescaling neurons

When running the model, we scale the outputs of the neuron


by

44
Knocking out and rescaling neurons

This ensures that the expected value of the weights stays the same
at run time

45
Early Stopping
• Another, more heuristically approach to regularization is early
stopping.
• This refers to choosing some rules after which to stop training.
• Example:
• Check the validation log-loss every 10 epochs.
• If it is higher than it was last time, stop and use the previous model (i.e.,
from 10 epochs previous)

46
Concept of a “pseudo-ensemble”
• a collection of child models
spawned from a parent model by
perturbing it according to some
noise process
• in a deep neural network, it trains
a pseudo-ensemble of child
subnetworks generated by
randomly masking nodes in the
parent network
47
Model 1

48
Model 2

49
Model 3 etc.

50
List of all the hyperparameters you can
tweak in a basic Multilayer Perceptron
1. the
(MLP) number of hidden layers
2. the number of neurons in each hidden layer
3. the activation function used in each hidden layer and
in the output layer. Generally, ReLU activation
function (or one of its variants) is a good default for
the hidden layers.
4. For the output layer, in general you will want the
logistic activation function for binary classification,
the softmax activation function for multiclass
classification, or no activation function for regression.
5. If the MLP overfits the training data, you can try
Reference:
• https://levelup.gitconnected.com/vanishing-and-
exploding-gradients-ae7fb88f3b66
• https://towardsdatascience.com/activation-functions-
neural-networks-1cbd9f8d91d6
• https://medium.com/arteos-ai/the-differences-between-
sigmoid-and-softmax-activation-function-12adee8cf322

52

You might also like