0% found this document useful (0 votes)
22 views24 pages

M3 L4 RNN Regularization

The document discusses various topics related to deep learning including artificial neural networks, feedforward and backpropagation, activation functions, optimizers, regularization techniques, recurrent neural networks, and transfer learning. It also includes examples of calculating outputs of a single neural unit for different input patterns and implementing logical operators using neural networks.

Uploaded by

Anitha Saravanan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views24 pages

M3 L4 RNN Regularization

The document discusses various topics related to deep learning including artificial neural networks, feedforward and backpropagation, activation functions, optimizers, regularization techniques, recurrent neural networks, and transfer learning. It also includes examples of calculating outputs of a single neural unit for different input patterns and implementing logical operators using neural networks.

Uploaded by

Anitha Saravanan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Module 3

Deep Learning 10 Hours

• Artificial Neural Networks (ANN): architecture

• Feed-forward and back propagation

• Activation functions

• Optimizers in deep learning

• Regularization techniques

• Recurrent neural networks

• Transfer learning

26-03-2024 2
Problem
Consider the unit shown on Figure. Suppose that the weights corresponding to
the three inputs have the following values:
w1 = 2
w2 = −4
w3 = 1
and the activation of the unit is given by the step-function:

Calculate what will be the output value y of the unit for each of the following
input patterns:

3
Problem
Logical operators (i.e. NOT, AND, OR, XOR, etc) are the building blocks of any computational device. Logical functions
return only two possible values, true or false, based on the truth or false values of their arguments.
For example, operator AND returns true only when all its arguments are true, otherwise (if any of the arguments is
false) it returns false. If we denote truth by 1 and false by 0, then logical function AND can be represented by the
following table:
This function can be implemented by a single-unit with two inputs:
if the weights are w1 = 1 and w2 = 1 and the activation function is:

Note that the threshold level is 2 (v ≥ 2).


a) Test how the neural AND function works.
b) Suggest how to change either the weights or the threshold level of this
single-unit in order to implement the logical OR function (true when at least one of the
arguments is true):

c) The XOR function (exclusive or) returns true only when one of the arguments is true and
another is false. Otherwise, it returns always false. This can be represented by the following
table:
Do you think it is possible to implement this function using a single unit? A network of several
units?

4
Cost function of neural networks

• Used to measure how far the actual values are from the expected value.
• Variable d to represent the true value
• Variable Y to represent the neuron prediction
• M number of output nodes
• The greater the error of the neural network, the higher the value of the cost
function is.
• Quadratic cost function

5
Cost function of neural networks
• Cross entropy function

• The cross entropy function is much more sensitive to the error.


• For this reason, the learning rules derived from the cross entropy function are
generally known to yield better performance.
• It is recommended to use the cross entropy driven learning rules except for inevitable
cases such as the regression.

6
Recurrent neural networks

• In making predictions about data in the form of sequences.

• Given the price of a stock in the last week, predict whether stock price will go up.

• Given a sentence (sequence of chars/words) predict its sentiment.

• Given a sentence in English, translate it to French - is a sequence-to-sequence prediction

task, because both inputs and outputs are sequences.

7
Recurrent neural networks
• All the neural architectures are inherently designed for multidimensional data in which the attributes are largely
independent of one another.

• However, certain data types such as time-series, text, and biological data contain sequential dependencies
among the attributes.

• Examples of such dependencies are as follows:

1. In a time-series data set, the values on successive time-stamps are closely related to one another. If one
uses the values of these time-stamps as independent features, then key information about the
relationships among the values of these time-stamps is lost.

2. Although text is often processed as a bag of words, one can obtain better semantic insights when the
ordering of the words is used. In such cases, it is important to construct models that take the sequencing
information into account. Text data is the most common use case of recurrent neural networks.

3. Biological data often contains sequences, in which the symbols might correspond to amino acids or one of
the nucleobases that form the building blocks of DNA. 8
Recurrent neural networks
• A recurrent neural network or RNN is a neural network which maps from an input space of
sequences to an output space of sequences in a stateful way.
• That is, the prediction of output yt depends not only on the input xt, but also on the hidden state
of the system, ht, which gets updated over time, as the sequence is processed.
• Such models can be used for sequence generation, sequence classification, and sequence
translation.
• RNNs - a kind of architecture which has a set of hidden units replicated at each time step, and
connections between them.
• RNN - A single set of input units, hidden units, and output units, and the hidden units feed into
themselves - graph of an RNN may have self-loops
• self-loops - mean that the values of the hidden units at one time step depend on their
values at the previous time step.
9
Delayed Sequence to sequence

10
Sequence to sequence

11
Sequence to vector

12
Vector to sequence

13
14
Example of a RNN and its unrolled representation

Each color corresponds to a weight matrix which is replicated at all time steps.
15
Example of an RNN which sums its inputs over time

• All of the units are linear.

• The hidden-to-output weight is 1,


which means the output unit just
copies the hidden activation.

• The hidden-to-hidden weight is 1,


which means that in the absence of
any input, the hidden unit just
remembers its previous value.

• The input-to-hidden weight is 1,


which means the input gets added to
the hidden activation in every time
step.

16
Example 2 RNN
• RNN which receives two inputs at
each time step, and which
determines which of the two
inputs has a larger sum over time
steps.
• The hidden unit is linear, and the
output unit is logistic.
• The output unit is a logistic unit
with a weight of 5. Recall that large
weights squash the function,
effectively making it a hard
threshold at 0.
• The hidden-to-hidden weight is 1,
so by default it remembers its
previous value.
• The input-to-hidden weights are 1
and -1, which means it adds one of
the inputs and subtracts the other.

17
Backprop Through Time unrolled computation graph

Forward pass: Backward pass:

18
Vectorized backprop rules

19
RNN

20
Regularization techniques
• Regularization is to add a regularization parameter to the error function E

Various choices of hyperplanes


With Regularization term
Without Regularization term 21
L2 Regularization

• Other names - ‘weight decay’, ‘ridge regression’, and ‘Tikhonov regularization’


• L2 regularization is to use the L2 or Euclidean norm for the regularization term.
• L2 norm of a vector x = (x1, x2, . . . , xn) is

• L2-regularized error function

• Add a hyperparameter to be able to adjust how much of the regularization to be used (called
the regularization parameter or regularization rate, and denoted by λ), and divide it by the size
of the batch used.
• Modified L2-regularized error function

22
L2 Regularization
• The intuition is that during the learning procedure, smaller weights will be preferred, but larger
weights will be considered if the overall decrease in error is significant.
• This explains why it is called ‘weight decay’.
• The choice of λ determines how much will small weights be preferred (when λ is large, the
preference for small weights will be great).
• Regularized error function:

• By taking the partial derivatives of this equation


For most of the classification
and prediction problems L2
regularization is used.
• Taking this back to the general weight update rule

23
L1 Regularization

• Other names - ‘lasso’ or ‘basis pursuit denoising’.

• L1 regularization uses the absolute value instead of the squares

• L1 is superior where there are a lot of irrelevant data - either very noisy data, or features that
are not informative, but it can also be sparse data (where most features are irrelevant because
they are missing).

• Applications of L1 regularization are in signal processing and robotics.

24
Architecture of a LSTM cell

25

You might also like