M3 L4 RNN Regularization
M3 L4 RNN Regularization
• Activation functions
• Regularization techniques
• Transfer learning
26-03-2024 2
Problem
Consider the unit shown on Figure. Suppose that the weights corresponding to
the three inputs have the following values:
w1 = 2
w2 = −4
w3 = 1
and the activation of the unit is given by the step-function:
Calculate what will be the output value y of the unit for each of the following
input patterns:
3
Problem
Logical operators (i.e. NOT, AND, OR, XOR, etc) are the building blocks of any computational device. Logical functions
return only two possible values, true or false, based on the truth or false values of their arguments.
For example, operator AND returns true only when all its arguments are true, otherwise (if any of the arguments is
false) it returns false. If we denote truth by 1 and false by 0, then logical function AND can be represented by the
following table:
This function can be implemented by a single-unit with two inputs:
if the weights are w1 = 1 and w2 = 1 and the activation function is:
c) The XOR function (exclusive or) returns true only when one of the arguments is true and
another is false. Otherwise, it returns always false. This can be represented by the following
table:
Do you think it is possible to implement this function using a single unit? A network of several
units?
4
Cost function of neural networks
• Used to measure how far the actual values are from the expected value.
• Variable d to represent the true value
• Variable Y to represent the neuron prediction
• M number of output nodes
• The greater the error of the neural network, the higher the value of the cost
function is.
• Quadratic cost function
5
Cost function of neural networks
• Cross entropy function
6
Recurrent neural networks
• Given the price of a stock in the last week, predict whether stock price will go up.
7
Recurrent neural networks
• All the neural architectures are inherently designed for multidimensional data in which the attributes are largely
independent of one another.
• However, certain data types such as time-series, text, and biological data contain sequential dependencies
among the attributes.
1. In a time-series data set, the values on successive time-stamps are closely related to one another. If one
uses the values of these time-stamps as independent features, then key information about the
relationships among the values of these time-stamps is lost.
2. Although text is often processed as a bag of words, one can obtain better semantic insights when the
ordering of the words is used. In such cases, it is important to construct models that take the sequencing
information into account. Text data is the most common use case of recurrent neural networks.
3. Biological data often contains sequences, in which the symbols might correspond to amino acids or one of
the nucleobases that form the building blocks of DNA. 8
Recurrent neural networks
• A recurrent neural network or RNN is a neural network which maps from an input space of
sequences to an output space of sequences in a stateful way.
• That is, the prediction of output yt depends not only on the input xt, but also on the hidden state
of the system, ht, which gets updated over time, as the sequence is processed.
• Such models can be used for sequence generation, sequence classification, and sequence
translation.
• RNNs - a kind of architecture which has a set of hidden units replicated at each time step, and
connections between them.
• RNN - A single set of input units, hidden units, and output units, and the hidden units feed into
themselves - graph of an RNN may have self-loops
• self-loops - mean that the values of the hidden units at one time step depend on their
values at the previous time step.
9
Delayed Sequence to sequence
10
Sequence to sequence
11
Sequence to vector
12
Vector to sequence
13
14
Example of a RNN and its unrolled representation
Each color corresponds to a weight matrix which is replicated at all time steps.
15
Example of an RNN which sums its inputs over time
16
Example 2 RNN
• RNN which receives two inputs at
each time step, and which
determines which of the two
inputs has a larger sum over time
steps.
• The hidden unit is linear, and the
output unit is logistic.
• The output unit is a logistic unit
with a weight of 5. Recall that large
weights squash the function,
effectively making it a hard
threshold at 0.
• The hidden-to-hidden weight is 1,
so by default it remembers its
previous value.
• The input-to-hidden weights are 1
and -1, which means it adds one of
the inputs and subtracts the other.
17
Backprop Through Time unrolled computation graph
18
Vectorized backprop rules
19
RNN
20
Regularization techniques
• Regularization is to add a regularization parameter to the error function E
• Add a hyperparameter to be able to adjust how much of the regularization to be used (called
the regularization parameter or regularization rate, and denoted by λ), and divide it by the size
of the batch used.
• Modified L2-regularized error function
22
L2 Regularization
• The intuition is that during the learning procedure, smaller weights will be preferred, but larger
weights will be considered if the overall decrease in error is significant.
• This explains why it is called ‘weight decay’.
• The choice of λ determines how much will small weights be preferred (when λ is large, the
preference for small weights will be great).
• Regularized error function:
23
L1 Regularization
• L1 is superior where there are a lot of irrelevant data - either very noisy data, or features that
are not informative, but it can also be sparse data (where most features are irrelevant because
they are missing).
24
Architecture of a LSTM cell
25