Deep Learning Midsem Merged Previous Batch

Download as pdf or txt
Download as pdf or txt
You are on page 1of 423

Deep Neural Network

AIML Module 1
Seetha Parameswaran
BITS Pilani
The author of this deck, Prof. Seetha Parameswaran,
is gratefully acknowledging the authors who made
their course materials freely available online.
Course Logistics
What we Learn…. (Module Structure)
1. Fundamentals of Neural Network
2. Multilayer Perceptron
3. Deep Feedforward Neural Network
4. Improve the DNN performance by Optimization and Regularization
5. Convolutional Neural Networks
6. Sequence Models
7. Attention Mechanism
8. Neural Network search
9. Time series Modelling and Forecasting
10. Other Learning Techniques
Text book
● Dive into Deep Learning by Aston Zhang, Zack C. Lipton, Mu Li, Alex
J. Smola. https://d2l.ai/chapter_introduction/index.html
Course Logistics
● Refer Canvas for the following
○ Handout
○ Schedule for Webinar
○ Schedule of Quiz, and Assignments.
○ Evaluation scheme
○ Session Slide Deck
○ Demo Lab Sheets
○ Quiz-I, Quiz-II
○ Assignment-I, Assignment-II
○ Sample QPs
● Lecture Recordings
○ Available on Microsoft teams
Honour Code
All submissions for graded components must be the result of your original effort.
It is strictly prohibited to copy and paste verbatim from any sources, whether
online or from your peers. The use of unauthorized sources or materials, as well
as collusion or unauthorized collaboration to gain an unfair advantage, is also
strictly prohibited. Please note that we will not distinguish between the person
sharing their resources and the one receiving them for plagiarism, and the
consequences will apply to both parties equally.

In cases where suspicious circumstances arise, such as identical verbatim


answers or a significant overlap of unreasonable similarities in a set of
submissions, will be investigated, and severe punishments will be imposed on
all those found guilty of plagiarism.
In case of Queries regarding the course….
Step 1: Post in the discussion forum.
● Read through the existing post and if you find any topic similar to your
concern, add on to the existing discussion.
● Avoid duplication of queries or issues.
Step 2: Email to the IC at [email protected] if the query or
issue is not resolved within 1 weeks time. Turn around for response to the
email is 48hrs.
● In the subject pl mention the phrase ”DNN” clearly.
● Use BITS email id for correspondence.
PATIENCE is highly APPRECIATED :)
What is Deep Learning?
Definitions
● Deep Learning is a type of machine learning based on artificial neural
networks in which multiple layers of processing are used to extract
progressively higher level features from data.
● Deep learning is a method in artificial intelligence (AI) that teaches
computers to process data in a way that is inspired by the human brain.
● Deep learning is a machine learning technique that teaches computers to
do what comes naturally to humans: learn by example.
● Deep learning is a subset of machine learning, which is essentially a
neural network with three or more layers.
● Deep Learning gets its name from the fact that we add more "Layers" to
learn from the data.
Why Deep Learning?
Why Deep Learning?
● Large amounts of data
● Lots and lots of unstructured data like images, text, audio, video
● Cheap, high-quality sensors
● Cheap computation - CPU, GPU, Distributed clusters
● Cheap data storage
● Learn by examples
● Automated feature generation
● Better learning capabilities
● Scalability
● Advance analytics can be applied
Why deep learning?
Deep Learning Timeline
Applications of Deep Learning
Breakthroughs with Neural Networks
Breakthroughs with Neural Networks
Breakthroughs with Neural Networks
Breakthroughs with Neural Networks
Breakthroughs with Neural Networks
Breakthroughs with Neural Networks
Many more applications….
● Write a program that predicts tomorrowʼs weather given geographic
information, satellite images, and a trailing window of past weather.
● Write a program that takes in a question, expressed in free-form text,
and answers it correctly.
● Write a program that given an image can identify all the people it
contains, drawing outlines around each.
● Write a program that presents users with products that they are likely
to enjoy but unlikely, in the natural course of browsing, to encounter.
Key components of DL problem
Core components of DL problem
1. The data that we can learn from.
2. A model of how to transform the data.
3. An objective function that quantifies how well (or badly) the model is
doing.
4. An algorithm to adjust the modelʼs parameters to optimize the
objective function.
Data
● Collection of examples.
● Data has to be converted to an useful and a suitable numerical
representation.
● Each example (or data point, data instance, sample) typically consists
of a set of attributes called features (or covariates), from which the
model must make its predictions.
● In the supervised learning problems, the attribute to be predicted is
designated as the label (or target).
● We need right data.
Data
● Dimensionality of data
○ Each example has the same number of numerical values. This data consist of
fixed-length vectors. Eg: Image
○ The constant length of the vectors as the dimensionality of the data.
○ Text data has varying-length data.
Model
● Model denotes the computational machinery for ingesting data of
one type, and spitting out predictions of a possibly different type.
● Deep learning models consist of many successive transformations of
the data that are chained together top to bottom, thus the name deep
learning.
Objective Function
● Learning means improving at some task over time.
● A formal mathematical system of learning machines is defined using
formal measures of how good (or bad) the models are. These formal
measures are called as objective functions.
● By convention, objective functions are defined so that lower is better.
● Because lower is better, these functions are sometimes called loss
functions.
Loss Functions
● To predict numerical values (regression), the most common loss
function is squared error.
● For classification, the most common objective is to minimize error
rate, i.e., the fraction of examples on which our predictions disagree
with the ground truth.
Loss Functions
● Loss function is defined with respect to the modelʼs parameters
and depends upon the dataset.
● We learn the best values of our modelʼs parameters by minimizing
the loss incurred on a set consisting of some number of examples
collected for training. However, doing well on the training data does
not guarantee that we will do well on unseen data. I.e Model has to
generalize better.
● When a model performs well on the training set but fails to generalize
to unseen data, we say that it is overfitting.
Optimization Algorithms
● Optimization Algorithm is an algorithm capable of searching for the
best possible parameters for minimizing the loss function.
● Popular optimization algorithms for deep learning are based on an
approach called gradient descent.
Kinds of Machine Learning Problems
Learning Problems
1. Supervised Learning
2. Unsupervised Learning
○ Situations where we feed the model a giant dataset containing only the features
○ Type and number of questions we could ask is limited only by our creativity.
○ Eg: Clustering, Generate synthetic data (uses GAN)
○ Self-supervised learning leverages unlabeled data to provide supervision in
training, such as by predicting some withheld part of the data using other parts.
○ Eg: Fill in the blanks
3. Reinforcement Learning
○ Develop an agent that interacts with an environment and takes actions over a
series of time steps.
Supervised Learning
● Task of predicting labels given input features.
● Each feature–label pair is called an example.
● Goal is to produce a model that maps any input to a label
prediction.
● The supervision comes into play because for choosing the
parameters, we (the supervisors) provide the model with a dataset
consisting of labeled examples, where each example is matched with
ditional probability of a label given input features.
Supervised Learning
Supervised Learning Tasks
● Regression
○ how much or how many question
● Classification
○ Binary classification
○ Multi-class classification
○ Multi-label classification
● Tagging
○ Multi-label classification
● Recommender systems
● Sequence Learning
○ Tagging and Parsing
○ Automatic speech recognition
○ Text to speech
○ Machine translation
Reading from TB Dive into Deep Learning
● Chapter 1
● Chapter 2 for Python Prelims, Linear
Algebra, Calculus, Probability
Next Session:
What is Neural Network?
Deep Learning
AIML Module 1
Seetha Parameswaran
BITS Pilani
The author of this deck, Prof. Seetha Parameswaran,
is gratefully acknowledging the authors who made
their course materials freely available online.
Artificial Neural Network
What are Neural Networks?
What are Neural Networks?
● It begins with human brain.

● Humans learn, solve problems, recognize patterns, create, think


deeply about something, meditate and many many more.....
● Humans learn through association. [Refer to Associationism for more
details.]
Observation: The Brain
● The brain is a mass of interconnected
neurons.
● Number of neurons is approximately
10^(10).
● Connections per neuron is approximately
10^(4 to 5).
● Neuron switching time is approximately
0.001 second.
● Scene recognition time is 1 second.
● 100 inference steps doesn't seem like
enough. Lot of parallel computation.
Brain: Interconnected Neurons
● Many neurons connect in to each neuron.
● Each neuron connects out to many neurons.
Biological Neuron
Connectionist Machines
● Network of processing elements, called artificial neural unit.
● The neurons are interconnected to form a network.
● All world knowledge is stored in the connections between these
elements.
● Neural networks are connectionist machines.
What are Artificial Neurons?
● Neuron is a processing element inspired by how the brain works.
● Similar to biological neuron, each artificial neuron will be do some
computation. Each neuron is interconnected to other neurons.
● Similar to brain, the interconnections between neurons store the
knowledge it learns. The knowledge is stored as parameters.
Properties of Artificial Neural Nets (ANNs)
● Many neuron-like threshold switching units.
● Many weighted interconnections among units.
● Highly parallel, distributed process.
● Emphasis on tuning parameters or weights automatically.
When to consider Neural Networks?
● Input is high-dimensional discrete or real-valued (e.g. raw sensor
input).
● Possibly noisy data. Data has lots of errors.
● Output is discrete or real valued or a vector of values.
● Form of target function is unknown.
● Human readability, in other words, explainability, of result is
unimportant.
● Examples:
○ Speech phoneme recognition
○ Image classification
○ Financial prediction
Perceptron
Perceptron
● One type of ANN system is based on a unit called a perceptron.
● A perceptron takes a vector of real-valued inputs, calculates a linear
combination of these inputs, then outputs a 1 if the result is greater
than some threshold and -1 otherwise.
NOT Logic Gate
Question:
● How to represent NOT gate using a perceptron?
● What are the parameters for the NOT perceptron?
● Data is given below.
Perceptron for NOT gate
● One input say x1.
● One output say o(X).
● Perceptron equation is o(X) = sign(w0 + w1 x1).
○ This should be > 0 for output to be 1.
● For each row of truth table, the equations are
○ w0 + 0 >0
○ w0 + w1 <0
● One solution is w0 = 1 and w1 = (-1). (Intuitive solution)
● This give a beautiful linear decision boundary.
Solution for NOT gate
AND Logic Gate
Question:
● How to represent AND gate using a perceptron?
● What are the parameters for the AND perceptron?
● Data is given below.
Perceptron for AND gate
● Two inputs say x1 and x2.
● One output say o(X).
● Perceptron equation is
o(X) = sign(w0 + w1 x1 + w2 x2).
○ This should be > 0 for output to be 1.
● For each row of truth table, the equations are
○ w0 + 0 + 0 <0
○ w0 + 0 + w2 < 0
○ w0 + w1 + 0 < 0
○ w0 + w1 + w2 > 0
● One solution is w1 = w2 = 1 and w0 = (-1).
● This give a beautiful linear decision boundary.
Perceptron for AND gate
DATA MODEL DECISION BOUNDARY
Exercise
1. Represent OR gate using Perceptron. Compute the parameters of the
perceptron.
2. Represent NOR gate using Perceptron. Compute the parameters of
the perceptron.
3. Represent NAND gate using Perceptron. Compute the parameters of
the perceptron.
Perceptron for OR gate
● Two inputs say x1 and x2.
● One output say o(X).
● Perceptron equation is
● o(X) = w0 + w1 x1 + w2 x2.
○ This should be > 0 for output to be 1.
● For each row of truth table, the equations are
○ w0 + 0 + 0 <0
○ w0 + 0 + w2 > 0
○ w0 + w1 + 0 > 0
○ w0 + w1 + w2 > 0
● One solution is w1 = w2 = 20 and w0 = (-10).
● This give a beautiful linear decision boundary.
Perceptron for OR gate
DATA MODEL DECISION BOUNDARY
Perceptron for NOR Gate
Perceptron for NAND Gate
Perceptron Learning Algorithm
Perceptron Learning Algorithm
Convergence of Perceptron Learning Algorithm
Can prove it will converge
● If training data is linearly separable.
● Learning rate is sufficiently small
○ The role of the learning rate is to moderate the degree to which weights are
changed at each step.
○ It is usually set to some small value (e.g., 0.1) and is sometimes made to decay as
the number of weight-tuning iterations increases.
Perceptron Learning Algorithm for NOT gate
● Assume w0 = w1 = 0. Let the learning rate = eta = 1.
● Equation is o(x) = sign(w0 + w1 x1)
Epoch 1
● For first example, x1 = 0 and d(x) = 1
○ sum = 0 + 0 = 0
○ o(x) = sign(sum) = -1
○ Not equal to d(x). So update parameters.
○ delta w1 = eta (d(x) - o(x)) x = 1 (1-(-1)) 0 = 0
○ delta w0 = eta (d(x) - o(x)) = 1 (1-(-1)) = 2 (for w0; x0 is assumed as 1)
○ New w1 = old w1 + delta w1 = 0 + 0 = 0
○ New w0 = old w0 + delta w0 = 0 + 2 = 2
● Use the above weights for next example.
Perceptron Learning Algorithm for NOT gate
● Second example uses w0 = 2 and w1 = 0. eta = 1.
● For second example, x1 = 1 and d(x) = -1
○ sum = 2 + 0 = 2
○ o(x) = sign(sum) = 1
○ Not equal to d(x)
○ So update parameters.
○ delta w1 = eta (d(x) - o(x)) x= 1 (-1-1) 1 = -2
○ delta w0 = eta (d(x) - o(x)) = 1 (-1-1) = -2
○ New w1 = old w1 + delta w1 = 0 + -2 = -2
○ New w0 = old w0 + delta w0 = 2 + -2 = 0
● Use the above weights for next epoch.
Perceptron Learning Algorithm for NOT gate
● From previous epoch w0 = 0 and w1 = -2. eta = 1.
Epoch 2
● For first example, x1 = 0 and d(x) = 1
○ sum = 0 + 0 = 0
○ o(x) = sign(sum) = -1
○ Not equal to d(x). So update parameters.
○ delta w1 = eta (d(x) - o(x)) x = 1 (1-(-1)) 0 = 0
○ delta w0 = eta (d(x) - o(x)) = 1 (1-(-1)) = 2 (for w0; x0 is assumed as 1)
○ New w1 = old w1 + delta w1 = -2 + 0 = -2
○ New w0 = old w0 + delta w0 = 0 + 2 = 2
Perceptron Learning Algorithm for NOT gate
● Second example uses w0 = 2 and w1 = -2. eta = 1.
● For second example, d(x) = 0
○ sum = 2 + -2 = 0
○ o(x) = sign(sum) = -1
○ Equal to d(x). So NO need to update parameters.
● The Algorithm converges with w0 = 2 and w1 = -2.
Demo Python Code

https://colab.research.google.com/drive/1DUVcOoUIWhl8GQKc6AWR1wi
0LaMeNkgD?usp=sharing

Student pl note:
The Python notebook is shared for anyone who has access to the link and the
access is restricted to use BITS email id. So pl do not access from non-BITS
email id and send requests for access.
Exercise
1. Represent OR gate using Perceptron. Compute the parameters of the
perceptron using perceptron learning algorithm.
2. Represent AND gate using Perceptron. Compute the parameters of
the perceptron using perceptron learning algorithm.
Representational Power of Perceptrons
● A perceptron represents a hyperplane decision surface in the
n-dimensional space of examples.
● The perceptron outputs a 1 for examples lying on one side of the
hyperplane and outputs a -1 for examples lying on the other side.
Linearly Separable Data
Linearly Separable Data
● Two sets of data points in a two dimensional space
are said to be linearly separable when they can be
completely separable by a single straight line.
● In general, two groups of data points are
separable in a n-dimensional space if they can be
separated by an (n-1) dimensional hyperplane.
● A straight line can be drawn to separate all the data
examples belonging to class +1 from all the
examples belonging to the class -1. Then the
two-dimensional data are clearly linearly separable.
● An infinite number of straight lines can be drawn to
separate the class +1 from the class -1.
Perceptron for Linearly Separable Data

Challenge: How to learn these n parameters w1 … wn ?


Solution: Use Perceptron learning algorithm.
Perceptron and its Learning - Review
● Data
○ Truth tables or set of examples
● Model
○ Perceptron
● Objective Function
○ deviation of desired output d(x) and the computed output o(x)
● Learning Algorithm
○ Perceptron Learning algorithm
Example of Linearly Separable data
Example of Linearly Separable data
Example of Linearly Separable data
Non-Linearly Separable Data
Non-linearly Separable Data
● Two groups of data points are non-linearly separable in a
2-dimensional space if they cannot be easily separated with a linear
line.
Perceptron for XOR Gate
Question:
● How to represent XOR gate using a perceptron?
● What are the parameters for the XOR perceptron?
● Data is given below.
● Data is non-linearly separable.
Solution for XOR
● Qn: How to represent XOR gate using a perceptron?
● Ans: Use Multilayer Perceptron (MLP)
● Introduce another layer in between the input and output.
● This in-between layer is called hidden layer.
MLP for XOR

MLP works :)

● Input (0,0)
○ First neuron = 0*1 + 0*1 = 0 < 1 (threshold). So o/p =0.
○ Second neuron = 0*(-1) + 0*(-1) = 0 > th. So o/p = 1
○ Third neuron = 0*1 + 1*1 = 1 < th. So o/p = 0. The desired output.
● Input (1,0)
○ First neuron = 1*1 + 0*1 = 1 = th. So o/p = 1
○ Second neuron = 1*(-1) + 0*(-1) = -1 = th. So o/p = 1
○ Third neuron = 1*1 + 1*1 = 2 < th. So o/p = 1. The desired output.
Solution of XOR Data
● Data
○ Truth table
● Model
Multi-layered Perceptron.
● Challenge
How to learn the parameters and threshold?
● Solution for learning
Use gradient descent algorithm
Gradient Descent Algorithm
Gradient Descent Algorithm
Gradient Descent Algorithm
Gradient Descent Algorithm
Gradient Descent Algorithm
Numerical Example
Equation is y = (x+5)²
When will it be minimum?
Use gradient descent algorithm
Assume starting point as 3 and
Learning rate as 0.01.
Gradient Descent Algorithm
Contour or hyperplane
Incremental (Stochastic) Gradient Descent

Incremental Gradient Descent can approximate Batch Gradient Descent


arbitrarily closely if η made small enough.
MultiLayer Perceptron (MLP)
MultiLayer Perceptron (MLP)
Requires
1. Forward Pass through each layer to compute the output.
2. Compute the deviation or error between the desired output and
computed output in the forward pass (first step). This morphs into
objective function, as we want to minimize this deviation or error.
3. The deviation has to be send back through each layer to compute the
delta or change in the parameter values. This is achieved using back
propagation algorithm.
4. Update the parameters.
Perceptron for XNOR Gate
XNOR
Ref:
Chapter 4 of Book: Machine Learning by
Tom M. Mitchell
Next Session:
How to Train MLP?
Power of MLP
Deep Learning
DSE Module 2
Seetha Parameswaran
BITS Pilani
The author of this deck, Prof. Seetha Parameswaran,
is gratefully acknowledging the authors who made
their course materials freely available online.

2
Single Perceptron for Regression
Real Valued Output

3
Linear Regression Example
● Suppose that we wish to estimate the prices of houses (in dollars)
based on their area (in square feet) and age (in years).
● The linearity assumption just says that the target (price) can be
expressed as a weighted sum of the features (area and age):
price = warea * area + wage * age + b
● warea and wage are called weights, and b is called a bias.
● The weights determine the influence of each feature on our prediction
and the bias just says what value the predicted price should take
when all of the features take value 0.

4
Data
● Data
○ The dataset is called a training dataset or training set.
○ Each row is called an example (or data point, data instance, sample).
○ The thing we are trying to predict is called a label (or target). \
○ The independent variables upon which the predictions are based are called
features (or covariates).

5
Affine transformations and Linear Models
● The equation of the form

● is an affine transformation of input features, which is characterized


by a linear transformation of features via weighted sum, combined
with a translation via the added bias.
● Models whose output prediction is determined by the affine
transformation of input features are linear models.
● The affine transformation is specified by the chosen weights (w) and
bias (b).

6
Loss Function
● Loss function is a quality measure for some given model or a measure
of fitness.
● The loss function quantifies the distance between the real and
predicted value of the target.
● The loss will usually be a non-negative number where smaller values
are better and perfect predictions incur a loss of 0.
● The most popular loss function in regression problems is the squared
error.

7
Squared Error Loss Function
● The most popular loss function in regression problems is the squared
error.
● For each example,
● For the entire dataset of n examples
○ average (or equivalently, sum) the losses

● When training the model, find parameters (w∗, b∗ ) that minimize the
total loss across all training examples

8
Minibatch Stochastic Gradient Descent (SGD)
● Apply Gradient descent algorithm on a random minibatch of examples
every time we need to compute the update.
● In each iteration,
○ Step 1: randomly sample a minibatch B consisting of a fixed number of training
examples.
○ Step 2: compute the derivative (gradient) of the average loss on the minibatch with
regard to the model parameters.
○ Step 3: multiply the gradient by a predetermined positive value η and subtract the
resulting term from the current parameter values.

9
Training using SGD Algorithm

PS: The number of epochs and the learning rate are both hyperparameters. Setting
hyperparameters requires some adjustment by trial and error.

10
Prediction
● Estimating targets given features is commonly called prediction or
inference.
● Given the learned model, values of target can be predicted, for any
set of features.

11
Single-Layer Neural Network
Linear regression is a single-layer neural network

○ Number of inputs (or feature dimensionality) in the input layer is d.


○ The inputs are x1 , . . . , xd.
○ The output is o1.
○ Number of outputs in the output layer is 1.
○ Number of layers for the neural network is 1. (conventionally we do not consider the input
layer when counting layers.)
○ Every input is connected to every output, This transformation is a fully-connected layer
or dense layer.

12
Multiple Perceptrons for Classification
Binary Outputs

13
Classification Example
● Each input consists of a 2 × 2 grayscale image.
● Represent each pixel value with a single scalar, giving four features
x1 , x2, x3 , x4.
● Assume that each image belongs to one among the categories
“square”, “triangle”, and “circle”.
● How to represent the labels?
○ Use label encoding. y ∈ {1, 2, 3}, where the integers represent {circle, square,
triangle} respectively.
○ Use one-hot encoding. y ∈ {(1, 0, 0), (0, 1, 0), (0, 0, 1)}.
■ y would be a three-dimensional vector, with (1, 0, 0) corresponding to “circle”,
(0, 1, 0) to “square”, and (0, 0, 1) to “triangle”.

14
Network Architecture
● A model with multiple outputs, one per class. Each output will
correspond to its own affine function.
○ 4 features and 3 possible output categories

15
Network Architecture
○ 12 scalars to represent the weights and 3 scalars to represent the biases
○ compute three logits, o1, o2, and o3, for each input.
○ weights is a 3×4 matrix and bias is 1×4 matrix

16
Softmax Operation
● Interpret the outputs of our model as probabilities.
○ Any output ŷj is interpreted as the probability that a given item belongs to class j.
Then choose the class with the largest output value as our prediction argmaxj yj .
○ If ŷ1 , ŷ2 , and ŷ3 are 0.1, 0.8, and 0.1, respectively, then predict category 2.
○ To interpret the outputs as probabilities, we must guarantee that, they will be
nonnegative and sum up to 1.
● The softmax function transforms the outputs such that they become
nonnegative and sum to 1, while requiring that the model remains
differentiable.

○ first exponentiate each logit (ensuring non-negativity) and then divide by their sum
(ensuring that they sum to 1)
● Softmax is a nonlinear function.

17
Log-Likelihood Loss Function / Cross-Entropy loss
● The softmax function gives us a vector ŷ, which we can interpret as
estimated conditional probabilities of each class given any input x,
○ ŷ1 = P (y = cat | x).
● Compare the estimates with reality by checking how probable the
actual classes are according to our model, given the features:

● Maximize P (Y | X) = minimizing the negative log-likelihood

18
Multi Layered Perceptrons (MLP)

19
Multilayer Perceptron
● With deep neural networks, use the data to jointly learn both a
representation via hidden layers and a linear predictor that acts upon
that representation.
● Add many hidden layers by stacking many fully-connected layers on
top of each other. Each layer feeds into the layer above it, until we
generate outputs.
● The first (L−1) layers learns the representation and the final layer is
the linear predictor. This architecture is commonly called a multilayer
perceptron (MLP).

20
MLP Architecture

○ MLP has 4 inputs, 3 outputs, and its hidden layer contains 5 hidden units.
○ Number of layers in this MLP is 2.
○ The layers are both fully connected. Every input influences every neuron in the
hidden layer, and each of these in turn influences every neuron in the output layer.
○ The outputs of the hidden layer are called as hidden representations or
hidden-layer variable or a hidden variable.

21
Nonlinearity in MLP

22
Activation Functions

23
Activation function
● Activation function of a neuron defines the output of that neuron given
an input or set of inputs.
● Activation functions decide whether a neuron should be activated or
not by calculating the weighted sum and adding bias with it.
● They are differentiable operators to transform input signals to outputs,
while most of them add non-linearity.
● Artificial neural networks are designed as universal function
approximators, they must have the ability to calculate and learn any
nonlinear function.

24
1. Step Function

25
2. Sigmoid (Logistic) Activation Function

Squashing function as: it squashes any input in the range (-inf, inf) to some value
in the range (0, 1) 26
3. Tanh (hyperbolic tangent) Activation Function

● More efficient because it has a wider range for faster learning and grading.
● The tanh activation usually works better than sigmoid activation function for
hidden units because the mean of its output is closer to zero, and so it
centers the data better for the next layer.
● Issues with tanh
○ computationally expensive
○ lead to vanishing gradients
27
4. ReLU Activation Function

● If we combine two ReLU units, we can recover a piecewise linear approximation of the Sigmoid function.
● Some ReLU variants: Softplus (Smooth ReLU), Noisy ReLU, Leaky ReLU, Parametric ReLU and Exponential ReLU
(ELU).
● Advantages
○ Fast Learning and Efficient computation
○ Fewer vanishing gradient problems
○ Sparse activation
○ Scale invariant (max operation)
● Disadvantages
○ Leads to exploding gradient.
28
Comparing Activation Functions

29
Training MLP

30
Two Layer Neural Network

31
Compute the Activations

32
Vectoring Forward Propagation

33
Neural Network Training – Forward Pass

34
Neural Network Training – Forward Pass

35
Neural Network Training – Forward Pass

36
37
38
39
40
41
42
43
44
45
46
Forward Propagation Algorithm

47
Computation Graph for Forward Pass

48
Cost Function

49
Neural Network Training – Backward Pass

50
Neural Network Training – Backward Pass

51
Neural Network Training – Backward Pass

52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
Computing the Gradients

76
Backpropagation Algorithm to compute Gradients

77
Computation Graph for BackProp

78
Neural Network Training – Update Parameters

79
Parameter Updation

80
Training of MultiLayer Perceptron (MLP)
Requires
1. Forward Pass through each layer to compute the output.
2. Compute the deviation or error between the desired output and
computed output in the forward pass (first step). This morphs into
objective function, as we want to minimize this deviation or error.
3. The deviation has to be send back through each layer to compute the
delta or change in the parameter values. This is achieved using back
propagation algorithm.
4. Update the parameters.

81
82
Scaling up for L layers in MLP

83
Forward Propagation Algorithm

84
Backward Propagation Algorithm

85
Update the Parameters

86
Ref:
Chapter 3 and 4 of T1

87
Next Session:
Power of MLP

88
Deep Learning
DSE Module 2
Seetha Parameswaran
BITS Pilani
The author of this deck, Prof. Seetha Parameswaran,
is gratefully acknowledging the authors who made
their course materials freely available online.

2
Complex Boundaries Using MLP

3
Composing Complex Decision Boundaries

4
Boolean over the Reals

5
Boolean over the Reals

6
Boolean over the Reals

7
Boolean over the Reals

8
Boolean over the Reals

9
Boolean over the Reals

10
More Complex Boundaries

11
More Complex Boundaries

12
Example

13
Question
(x1, x2) are input features and target classes are
either +1 or -1 as shown in the figure.
A. What is the minimum number of hidden layers and
hidden nodes required to classify the following
dataset with 100% accuracy using a fully connected
multilayer perceptron network? Step activation
functions are used at all nodes, i.e., output=+1 if total
weighted input >= bias b at a node, else output = -1.
B. Show the minimal network architecture by
organizing the nodes in each layer horizontally. Show
the node representing x1 at the left on the input layer.
Organize the hidden nodes in ascending order of
bias at that node. Specify all weights and bias values
at all nodes. Weights can be only -2.5, 2.5 or 0, and
bias +ve/-ve multiples of 2.5.
14
Solution
A. 2 hidden layers, 4 nodes in first hidden layer and 2 nodes in second
hidden layer needed.
B.

15
MLP as Universal Boolean Functions

16
Multi Layered Perceptrons for Boolean Functions

17
How many layers for a Boolean MLP?

18
How many layers for a Boolean MLP?

19
How many layers for a Boolean MLP?

20
How many layers for a Boolean MLP?

21
How many layers for a Boolean MLP?

22
How many layers for a Boolean MLP?

23
How many layers for a Boolean MLP?

24
How many layers for a Boolean MLP?

25
How many layers for a Boolean MLP?

26
How many layers for a Boolean MLP?

27
Reducing a Boolean Function

28
Reducing a Boolean Function

29
Reducing a Boolean Function

30
Largest irreducible DNF?

31
Largest irreducible DNF?

32
Width of a Single Layer Boolean MLP

33
Width of a Single Layer Boolean MLP

34
Width of a Single Layer Boolean MLP

35
Width of deep MLP

36
MLP XOR

37
Width of deep MLP

38
Width of deep MLP

39
Width of deep MLP

40
Width of Single Layer Boolean MLP

41
A better representation…

42
Challenge of Depth

43
Need of Depth

44
Network Size

45
46
Question
How many perceptrons are required to represent W ⊕ X ⊕ Y ⊕ Z ?

47
Solution

48
Solution

49
Computational Graph

50
Computational Graph: Example

51
Computational Graph for Logistic Regression

52
Computational Graph for Back Propagation

53
Computational Graph for Back Propagation

54
Computational Graph for Back Propagation

55
Computational Graph for Back Propagation

56
Computational Graph for Back Propagation

57
Computational Graph for Back Propagation

58
Computational Graph for Back Propagation

59
Computational Graph for Back Propagation

60
Computational Graph for Back Propagation

61
Computational Graph for Back Propagation

62
Computational Graph for Back Propagation

63
Computational Graph for Back Propagation

64
Computational Graph for Back Propagation

65
Computational Graph for Back Propagation

66
Computational Graph for Back Propagation

67
Computational Graph for Back Propagation

68
Computational Graph for Back Propagation

69
Question 8

70
Solution for Q8

71
Solution for Q8

72
Question 7
Draw the computational graph for Sigmoid function and show the gradient
computation also. Use generic equations.

73
Demo of DNN
1. XOR Implementation
https://colab.research.google.com/drive/1xVVpeU3q4bIOexV0J3NhYb
LVCHwaOl6R#scrollTo=GRaiuHtKI1Sq

2. DNN using Tensorflow


https://colab.research.google.com/drive/1307HTGrHmMMQaiBlNm10
vWHQcUSNUDTR#scrollTo=cPZ81DJmK-W3

74
Example with Relu

75
Example with Relu

76
Question

77
Solution

78
Question

79
80
Ref:
● http://mlsp.cs.cmu.edu/people/rsingh/docs/
Chapter1_Introduction.pdf
● http://mlsp.cs.cmu.edu/people/rsingh/docs/
Chapter2_UniversalApproximators.pdf

81
Next Session:
Mod3: Optimization
Refresh: Calculus

82
Deep Neural Network
AIML Module 3
Seetha Parameswaran
BITS Pilani

1
The author of this deck, Prof. Seetha Parameswaran,
is gratefully acknowledging the authors who made
their course materials freely available online.

2
Optimization

3
What we Learn….
3.1 Challenges in Neural Network Optimization – saddle points and
plateau
3.2 Non-convex optimization intuition
3.3 Overview of optimization algorithms
3.4 Momentum based algorithms
3.5 Algorithms with Adaptive Learning Rates

4
Optimization Algorithms

5
Optimization Algorithm
● Optimization algorithms train deep learning models.
● Optimization algorithms are the tools that allow
○ continue updating model parameters
○ to minimize the value of the loss function, as evaluated on the
training set.
● In optimization, a loss function is often referred to as the objective
function of the optimization problem.
● By tradition and convention most optimization algorithms are concerned
with minimization.

6
Optimization

7
Why Optimization Algorithm?
● The performance of the optimization algorithm directly affects the
modelʼs training efficiency.
● Understanding the principles of different optimization algorithms and the
role of their hyperparameters will enable us to tune the
hyperparameters in a targeted manner to improve the performance of
deep learning models.
● The goal of optimization is to reduce the training error. The goal of
deep learning is to reduce the generalization error, this requires reduction
in overfitting also.

8
Optimization Challenges in Deep Learning
● Local minima
● Saddle points
● Vanishing gradients

9
Local Minima
● For any objective function f (x), if
the value of f (x) at x is smaller
than the values of f (x) at any other
points in the vicinity of x, then f (x)
could be a local minimum.
● If the value of f (x) at x is the
minimum of the objective function
over the entire domain, then f (x) is
the global minimum.
● In minibatch stochastic gradient
descent, the natural variation of
gradients over minibatches is able
to dislodge the parameters from
local minima.
10
Finding Minimum of a Function

11
Finding Minimum of a Function

12
Derivatives of a Function

13
Saddle points
● A saddle point is any location where
all gradients of a function vanish
but which is neither a global nor a
local minimum.
● Eg: f (x, y) = x^2 − y^2
○ saddle point at (0, 0)
○ Maximum wrt y
○ Minimum wrt x

14
Derivatives at Saddle (Inflection) Point

15
Derivatives at Saddle (Inflection) Point

16
Functions of Multiple Variables

17
Gradient of a Scalar Function

18
Gradient of a Scalar Function with multiple variables

19
Hessian

20
Gradient of a Scalar Function with multiple variables

21
Solution of Unconstrained Minimization
1. Solve for X where the gradient equation equal to zero.
⛛f(x) = 0
2. Compute the Hessian matrix ⛛2f(x) at the candidate solution and verify
that
○ Local Minimum
■ Eigenvalues of Hessian matrix are all positive
○ Local Maximum
■ Eigenvalues of Hessian matrix are all negative
○ Saddle Point
■ Eigenvalues of Hessian matrix at the zero-gradient position are
negative and positive

22
Example

23
Example

24
Example

25
Vanishing gradients
● Function f (x) = tanh(x)
● f ′(x) = 1 − tanh2(x)
○ f ′(4) = 0.0013.
● Gradient of f is close to nil.
● Vanishing gradients can
cause optimization to stall.
○ Reparameterization of
the problem helps.
○ Good initialization of the
parameters

26
Gradient Descent

27
How to find Global Minima?

28
Find Global Minima Iteratively

29
Approach of Gradient Descent

30
Approach of Gradient Descent

31
Approach of Gradient Descent

gd(eta, grad)
new = old - eta * grad

32
Approach of Gradient Descent
● First order Gradient Descent algos consider the first order derivatives
to get the value (magnitude) and direction of update.

gd(eta = 0.2, f_grad)


def gd(eta, f_grad):
x = 10.0
results = [x]
for i in range(10):
x -= eta * f_grad(x)
results.append(float(x))
return results
33
Effect of Learning Rate on Gradient Descent

34
Learning Rate
● The role of the learning rate is to moderate the degree to which
weights are changed at each step.
● Learning rate η is set by the algorithm designer.
● If the learning rate that is too small, it will cause x to update very
slowly, requiring more iterations to get a better solution.
● If the learning rate that is too large, the solution oscillates and in the
worst case it might even diverge.

35
Learning rate and Gradient Descent

gd(eta = 0.05, f_grad) gd(eta = 1.1, f_grad)

Slow Learning Oscillations


If we pick eta too small, we If we pick eta too large, the
make little progress. solution oscillates and in the
worst case it might diverge. 36
Stochastic Gradient Descent

37
Stochastic Gradient Descent
● In deep learning, the objective function is the average of the loss functions
for each example in the training dataset.
● Given a training dataset of n examples, let fi (x) is the loss function with
respect to the training example of index i, where x is the parameter vector.
● The objective function

● The gradient of the objective function at x

● Update x as

● Computational cost of each iteration is O(1).


38
SGD with Constant Learning Rate
● Trajectory of the variables in the stochastic gradient descent is much
more noisy. This is due to the stochastic nature of the gradient. Even
near the minimum, uncertainty is injected by the instantaneous
gradient via η∇fi (x).

def sgd(x1, x2, s1, s2, f_grad):


g1, g2 = f_grad(x1, x2)
# Simulate noisy gradient
g1 += np.random.normal( 0.0, 1, (1,))
g2 += np.random.normal( 0.0, 1, (1,))
eta_t = eta * lr()
return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0)

39
Dynamic Learning Rate

40
Dynamic Learning Rate
● Replace η with a time-dependent learning rate η(t)
○ adds to the complexity of controlling convergence of an optimization algorithm.
● A few basic strategies that adjust η over time.

1. Piecewise constant
a. decrease the learning rate, e.g., whenever progress in optimization stalls.
b. This is a common strategy for training deep networks. A
2. Exponential decay
a. Leads to premature stopping before the algorithm has converged.
3. Polynomial decay with α = 0.5. 41
Dynamic Learning Rate
● Replace η with a time-dependent learning rate η(t)
○ adds to the complexity of controlling convergence of an optimization algorithm.
● A few basic strategies that adjust η over time.

42
Exponential Decay
def exponential_lr():
global t
t += 1
return math.exp(-0.1 * t)

● Variance in the parameters is significantly reduced.


● The algorithm fails to converge at all.

43
Polynomial decay
def polynomial_lr():
global t
t += 1
return (1 + 0.1 * t) ** (-0.5)

● Use a polynomial decay


○ Learning rate decays with the inverse square root of the number of steps
○ Convergence gets better after only 50 steps.

44
Review
● Gradient descent
○ Uses the full dataset to compute gradients and to update parameters, one pass
at a time.
○ Gradient Descent is not particularly data efficient whenever data is very similar.
● Stochastic Gradient descent
○ Processes one observation at a time to make progress.
○ Stochastic Gradient Descent is not particularly computationally efficient since
CPUs and GPUs cannot exploit the full power of vectorization.
○ For noisy gradients, choice of the learning rate is critical.
■ If we decrease it too rapidly, convergence stalls.
■ If we are too lenient, we fail to converge to a good enough solution since
noise keeps on driving us away from optimality.
● MInibatch SGD
○ Accelerate computation, or better or computational efficiency.
○ Averaging gradients reduce the amount of variance.

45
Minibatch Stochastic Gradient Descent

46
Minibatch Stochastic Gradient Descent
● In each iteration, we first randomly sample a minibatch B consisting of
a fixed number of training examples.
● We then compute the derivative (gradient) of the average loss on the
minibatch with regard to the model parameters.
● Finally, we multiply the gradient by a predetermined positive value η
and subtract the resulting term from the current parameter values.

● |B| represents the number of examples in each minibatch (the batch


size) and η denotes the learning rate.

47
Minibatch Stochastic Gradient Descent Algorithm
● Gradients at time t is calculated as

● |B| represents the number of examples in each minibatch (the batch


size) and η denotes the learning rate.
● is the stochastic gradient descent for sample i
using the weights updated at time t − 1.

48
SGD Algorithm

49
Momentum

50
Leaky Average in Minibatch SGD
● Replace the gradient computation by a “leaky average “ for better
variance reduction. β ∈ (0, 1).

● This effectively replaces the instantaneous gradient by one thatʼs


been averaged over multiple past gradients.
● v is called momentum. It accumulates past gradients.
● Large β amounts to a long-range average and small β amounts to
only a slight correction relative to a gradient method.
● The new gradient replacement no longer points into the direction of
steepest descent on a particular instance any longer but rather in the
direction of a weighted average of past gradients.
51
Momentum Algorithm

52
Momentum Method Example
● Consider a moderately distorted ellipsoid objective
● f has its minimum at (0, 0). This function is very flat in x1 direction.

● For eta = 0.4. Without momentum ● For eta = 0.6. Without momentum
● The gradient in the x2 direction ● Convergence in the x1 direction
oscillates than in the horizontal x1 improves but the overall solution
direction. quality is diverging. 53
Momentum Method Example
● Consider a moderately distorted ellipsoid objective
● Apply momentum for eta = 0.6

● For beta = 0.5. ● For beta = 0.25.


● Converges well. Lesser oscillations. ● Reduced convergence. More
Larger steps in x1 direction. oscillations. Larger magnitude of
oscillations. 54
Momentum method: Summary
● Momentum replaces gradients with a leaky average over past
gradients. This accelerates convergence significantly.
● Momentum prevents stalling of the optimization process that is much
more likely to occur for stochastic gradient descent.
● The effective number of gradients is given by 1/ (1−β ) due to
exponentiated downweighting of past data.
● Implementation is quite straightforward but it requires us to store an
additional state vector (momentum v).

55
Adagrad

56
Adagrad
● Used for features that occur infrequently (sparse features)
● Adagrad uses aggregate of the squares of previously observed
gradients.

57
Adagrad Algorithm
● Variable st to accumulate past gradient variance.

● Operation are applied coordinate wise. √ (1 / v) has entries √( 1/ vi)


and u · v has entries ui vi. η is the learning rate and ε is an additive
constant that ensures that we do not divide by 0.
● Initialize s0 = 0.

58
Adagrad: Summary
● Adagrad decreases the learning rate dynamically on a per-coordinate
basis.
● It uses the magnitude of the gradient as a means of adjusting how
quickly progress is achieved - coordinates with large gradients are
compensated with a smaller learning rate.
● If the optimization problem has a rather uneven structure Adagrad can
help mitigate the distortion.
● Adagrad is particularly effective for sparse features where the learning
rate needs to decrease more slowly for infrequently occurring terms.
● On deep learning problems Adagrad can sometimes be too
aggressive in reducing learning rates.
59
RMSProp

60
RMSProp
● Adagrad use learning rate that decreases at a predefined schedule of
effectively O(t− 1/2 ).
● RMSProp algorithm decouples rate scheduling from
coordinate-adaptive learning rates. This is essential for non-convex
optimization.

61
RMSProp Algorithm
● Use leaky average to accumulate past gradient variance.

● Parameter gamma > 0.


● The constant ε > 0 is set to 10−6 to ensure that we do not suffer from
division by zero or overly large step sizes.

62
RMSProp Algorithm

63
RMSProp Example
ef rmsprop_2d(x1, x2, s1, s2):
g1, g2, eps = 0.2 * x1, 4 * x2, 1e-6
s1 = gamma * s1 + ( 1 - gamma) * g1 ** 2
s2 = gamma * s2 + ( 1 - gamma) * g2 ** 2
x1 -= eta / math.sqrt(s1 + eps) * g1
x2 -= eta / math.sqrt(s2 + eps) * g2
return x1, x2, s1, s2

64
Adagrad: Summary
● RMSProp is very similar to Adagrad as both use the square of the
gradient to scale coefficients.
● RMSProp shares with momentum the leaky averaging. However,
RMSProp uses the technique to adjust the coefficient-wise
preconditioner.
● The learning rate needs to be scheduled by the experimenter in
practice.
● The coefficient γ (gamma) determines how long the history is when
adjusting the per-coordinate scale.

65
Adam

66
Review of techniques learned so far
1. Stochastic gradient descent
○ more effective than Gradient Descent when solving optimization problems, e.g.,
due to its inherent resilience to redundant data.
2. Minibatch Stochastic gradient descent
○ affords significant additional efficiency arising from vectorization, using larger sets
of observations in one minibatch. This is the key to efficient multi-machine,
multi-GPU and overall parallel processing.
3. Momentum
○ added a mechanism for aggregating a history of past gradients to accelerate
convergence.
4. Adagrad
○ used per-coordinate scaling to allow for a computationally efficient preconditioner.
5. RMSProp
○ decoupled per-coordinate scaling from a learning rate adjustment.

67
Adam
● Adam combines all these techniques into one efficient learning
algorithm.
● More robust and effective optimization algorithms to use in deep
learning.
● Adam can diverge due to poor variance control. (disadvantage)
● Adam uses exponential weighted moving averages (also known as
leaky averaging) to obtain an estimate of both the momentum and
also the second moment of the gradient.

68
Adam Algorithm
● State variables

● β1 and β2 are nonnegative weighting parameters. Common choices


for them are β1 = 0.9 and β2 = 0.999. That is, the variance estimate
moves much more slowly than the momentum term.
● Initialize v0 = s0 = 0
● Normalize the state variables

69
Adam Algorithm
● Rescale the gradient

● Compute updates

70
Adam Algorithm

71
Adam: Summary
● Adam combines features of many optimization algorithms into a fairly
robust update rule.
● Adam uses bias correction to adjust for a slow startup when
estimating momentum and a second moment.
● For gradients with significant variance we may encounter issues with
convergence. They can be amended by using larger minibatches or
by switching to an improved estimate for state variables. Yogi
algorithm offers such an alternative.

72
Learning Rate: Summary
1. Adjusting the learning rate is often just as important as the actual
algorithm.
2. Magnitude of the learning rate matters. If it is too large, optimization
diverges, if it is too small, it takes too long to train or we end up with a
suboptimal result. Momentum algo helps.
3. The rate of decay is just as important. If the learning rate remains
large we may simply end up bouncing around the minimum and thus
not reach optimality. We want the rate to decay, but probably more
slowly than O(t− 1/ 2 ) .
4. Initialization pertains both to how the parameters are set initially and
also how they evolve initially. This is known as warmup, i.e., how
rapidly we start moving towards the solution initially.
73
Numerical Problems

74
Question with Solution

75
Question

1. Compute the value that minimizes (w1 , w2). Compute the minimum
possible value of error.
2. What will be value of (w1 , w2 ) at time (t + 1) if standard gradient descent is
used?
3. What will be value of (w1 , w2 ) at time (t + 1) if momentum is used?
4. What will be value of (w1 , w2 ) at time (t + 1) if RMSPRop is used?
5. What will be value of (w1 , w2 ) at time (t + 1) if Adam is used?

76
Solution

77
Solution

78
Solution

79
Ref TB Dive into Deep Learning
● Chapter 12 (online version)

80
Next Session:
Regularization

81
Deep Neural Network
AIML Module 5
Seetha Parameswaran
BITS Pilani

1
The author of this deck, Prof. Seetha Parameswaran,
is gratefully acknowledging the authors who made
their course materials freely available online.

2
Regularization Techniques

3
What we Learn….
4.1 Model Selection
4.2 Underfitting, and Overfitting
4.3 L1 and L2 Regularization
4.4 Dropout
4.5 Challenges - Vanishing and Exploding Gradients, Covariance shift
4.6 Parameter Initialization
4.7 Batch Normalization

4
Generalization in DNN

5
Generalization
● Goal is to discover patterns that generalize.
○ The goal is to discover patterns that capture regularities in the
underlying population from which our training set was drawn.
○ Models are trained a sample of data.
○ When working with finite samples, we run the risk that we might
discover apparent associations that turn out not to hold up when we
collect more data or on newer samples.
● The trained model should predict for newer or unseen data. This problem
is called generalization.

6
Training Error and Generalization Error
● Training error is the error of our model as calculated on the training
dataset.
○ Obtained while training the model.
● Generalization error is the expectation of our modelʼs error, if an
infinite stream of additional data examples drawn from the same
underlying data distribution as the original sample were applied on the
model.
○ Cannot be computed, but estimated.
○ Estimate the generalization error by applying the model to an independent test set,
constituted of a random selection of data examples that were withheld from the
training set.

7
Model Complexity
● Simple models and abundant data
○ Expect the generalization error to resemble the training error.
● More complex models and fewer examples
○ Expect the training error to go down but the generalization gap to grow.
● Model complexity
○ A model with more parameters might be considered more complex.
○ A model whose parameters can take a wider range of values might be more
complex.
○ A neural network model that takes more training iterations are more complex, and
one subject to early stopping (fewer training iterations) are less complex.

8
Factors that influence the generalizability of a model
1. The number of tunable parameters.
○ When the number of tunable parameters, called the degrees of freedom, is large,
models tend to be more susceptible to overfitting.
2. The values taken by the parameters.
○ When weights can take a wider range of values, models can be more susceptible
to overfitting.
3. The number of training examples.
○ It is trivially easy to overfit a dataset containing only one or two examples even if
your model is simple. But overfitting a dataset with millions of examples requires
an extremely flexible model.
4.

9
Model Selection
● Model selection is the process of selecting the final model after
evaluating several candidate models.
● With MLPs, compare models with
○ different numbers of hidden layers,
○ different numbers of hidden units
○ different activation functions applied to each hidden layer.
● Use Validation dataset to determine the best among our candidate
models.

10
Validation dataset
● Never rely on the test data for model selection.
○ Risk of overfit the test data
● Do not rely solely on the training data for model selection
○ We cannot estimate the generalization error on the very data that we use to train
the model.
● Split the data three ways, incorporating a validation dataset (or
validation set) in addition to the training and test datasets.
● In deep learning, with millions of data available, the split is generally
○ Training = 98-99 % of the original dataset
○ Validation = 1-2 % of training dataset
○ Testing = 1-2 % of the original dataset

11
Just Right Model
● High Training accuracy
● High Validation accuracy
● Low Bias and Low Variance
● Usually care more about the
validation error than about the gap
between the training and validation
errors.

12
Underfitting
● Low Training accuracy and Low Validation accuracy.

● Training error and validation error are both substantial but there is a little gap
between them.
● The model is too simple (insufficiently expressive) to capture the pattern that we
are trying to model.
● If generalization gap between our training and validation errors is small, a more
complex model may be better.

13
Overfitting
● The phenomenon of fitting the
training data more closely than fit
the underlying distribution is called
overfitting.
● High Training accuracy and Low
Validation accuracy
● Training error is significantly lower
than the validation error.
● The techniques used to combat
overfitting are called
regularization.

14
Underfitting or Overfitting?

Simple Model Complex Model Just right model


Underfitting Overfitting

15
Polynomial degree and underfitting vs. overfitting

16
Model complexity and dataset size
● More data, fit a more complex model.
● More data, the generalization error typically decreases.

● Less data, simpler models may be more difficult to beat.


● Less data, more likely and more severely, models may over-fit.

The current success of deep learning owes to the current abundance


of massive datasets due to Internet companies, cheap storage,
connected devices, and the broad digitization of the economy.

17
Deep Learning Model Selection

18
Regularization

19
Regularization Techniques
● Weight Decay ( L2 regularization)
● Dropout
● Early Stopping

20
Weight Decay

21
L2 Regularization
● Measure the complexity of a linear function f (x) = w⊤x by some norm
of its weight vector, e.g., ∥w∥2.
● Add the norm as a penalty term to the problem of minimizing the loss.
This will ensure that the weight vector is small.
● The objective function becomes minimizing the sum of the
prediction loss and the penalty term.
● L2-regularized linear models constitute the ridge regression algorithm.

22
L2 Regularization
● The trade off between standard loss and the additive penalty is given by
regularization constant λ, a non-negative hyperparameter.

● For λ = 0, we recover the original loss function.


● For λ > 0, we restrict the size of ∥w∥.
● Smaller values of λ correspond to less constrained w, whereas larger
values of λ constrain w more considerably.
● By squaring the L2 norm, we remove the square root, leaving the sum of
squares of each component of the weight vector.
● The derivative of a quadratic function ensure that the 2 and 1/2 cancel
out.
23
L1 Regularization
● L1-regularized linear regression is known as lasso regression.
● L1 penalties lead to models that concentrate weights on a small set of
features by clearing the other weights to zero. This is called feature
selection.

24
L2 Regularization L1 Regularization
Sum of square of weights Sum of absolute value of weights

Learn complex data patterns Generates simple and interpretable models

Estimate mean of data Estimate median of data

Not robust to outliers Robust to outliers

Shrink coefficients equally Shrink coefficients to zero

Non sparse solution Sparse solution

One solution Multiple solutions

No feature selection Selects features

Useful for collinear features Useful for dimensionality reduction

25
Dropout

26
Smoothness
● Classical generalization theory suggests that to close the gap
between train and test performance, aim for a simple model.
● Simplicity can be achieved
○ Using weight decay
○ Smoothness, i.e., that the function should not be sensitive to small changes to its
inputs.
● Injecting noise enforces smoothness
○ training with input noise
○ inject noise into each layer of the network before calculating the subsequent layer
during training.

27
Dropout
● Dropout involves injecting noise while computing each internal layer
during forward propagation.
● It has become a standard technique for training neural networks.
● The method is called dropout because we literally drop out some
neurons during training.
● Apply dropout to a hidden layer, zeroing out each hidden unit with
probability p.
● The calculation of the outputs no longer depends on dropped out
neurons and their respective gradient also vanishes when performing
backpropagation. (in that iteration)
● Dropout is disabled at test time.
28
Without Dropout

29
Dropout for first iteration

30
Dropout for second iteration

31
Dropout
● Dropout gives a smaller neural network, giving the effect of
regularization.
● In general,
○ Vary keep probability (0.5 to 0.8) for each hidden layer.
○ The input layer has a keep probability of 1.0 or 0.9.
○ The output layer has a keep probability of 1.0.

32
Early Stopping

33
Before Early stopping
● When training large models, training error decreases steadily over
time, but validation set error begins to rise again.
● Training objective decreases consistently over time.
● Validation set average loss begins to increase again, forming an
asymmetric U-shaped curve.

34
Early stopping
● No longer looking for a local minimum of validation error, while
training.
● Train until the validation set error has not improved for some amount
of time.
● Every time the error on the validation set improves, store a copy of the
model parameters. When the training algorithm terminates, return
these parameters.

35
Early stopping
● Effective and simple form of regularization.
● Trains simpler models

36
Early Stopping code

37
Numerical Stability and Initialization

38
Why Initialization is important?
● The choice of initialization is crucial for maintaining numerical stability.
● The choices of initialization can be tied up in interesting ways with the
choice of the nonlinear activation function.
● Which function we choose and how we initialize parameters can
determine how quickly our optimization algorithm converges.
● Poor choices can cause to encounter exploding or vanishing gradients
while training.

39
Vanishing and Exploding Gradients
● Consider a deep network with L layers, input x and output o. With
each layer l defined by a transformation fl parameterized by weights
W(l) , whose hidden variable is h(l)

● If all the hidden variables and the input are vectors, then the gradient
of o with respect to any set of parameters W(l)

● Gradient is the product of L − l matrices M(L) · . . . · M(l+1) and the


gradient vector v(l) .
40
Vanishing and Exploding Gradients
● The matrices M(l) may have a wide variety of eigenvalues. They might
be small or large, and their product might be very large or very small.
● Gradients of unpredictable magnitude also threaten the stability of the
optimization algorithms.
● Parameter updates may be either
(i) excessively large, destroying our model (the exploding gradient
problem);
(ii) excessively small (the vanishing gradient problem), rendering
learning impossible as parameters hardly move on each update.

41
Vanishing Gradients
● Activation function sigmoid σ can cause the vanishing gradient
problem.
○ The sigmoidʼs gradient vanishes both when its inputs are large and when they are
small.
○ When backpropagating through many layers, where the inputs to many of the
sigmoids are close to zero, the gradients of the overall product may vanish.
● Solution: Use ReLU for hidden lakers. ReLU is more stable.

42
Parameter Initialization
1. Default Initialization
○ Used a normal distribution to initialize the values of the parameters.
2. Xavier Initialization
○ samples weights from a Gaussian distribution with zero mean and variance

○ now-standard and practically beneficial

43
Batch Normalization

44
Why Batch Normalization?
1. Standardize the input features to each have a mean of zero and
variance of one. This standardization puts the parameters a priori at a
similar scale. Better optimization.
2. A MLP or CNN, as we train, the variables in intermediate layers may
take values with widely varying magnitudes: both along the layers
from the input to the output, across units in the same layer, and over
time due to our updates to the model parameters. This drift in the
distribution of such variables could hamper the convergence of the
network.
3. Deeper networks are complex and easily capable of overfitting. This
means that regularization becomes more critical.
45
Batch Normalization
● Batch normalization is a popular and effective technique that
consistently accelerates the convergence of deep networks.
● Batch normalization is applied to individual layers.
● It works as follows:
○ In each training iteration, first normalize the inputs (of batch normalization) by
subtracting their mean and dividing by their standard deviation, where both
are estimated based on the statistics of the current minibatch.
○ Next, apply a scale coefficient and a scale offset.
● Due to the normalization based on batch statistics that batch
normalization derives its name.
● Batch normalization works best for moderate minibatches sizes in the
50 to 100 range.
46
Batch Normalization
● Denote by x ∈ B an input to batch normalization (BN) that is from a
minibatch B, batch normalization transforms x as

● μ̂B is the sample mean and σ̂B is the sample standard deviation of the
minibatch B.
● After applying standardization, the resulting minibatch has zero mean
and unit variance.

47
Batch Normalization
● Denote by x ∈ B an input to batch normalization (BN) that is from a
minibatch B, batch normalization transforms x as

● μ̂B is the sample mean and σ̂B is the sample standard deviation of the
minibatch B.
● After applying standardization, the resulting minibatch has zero mean
and unit variance.

48
Batch Normalization
● Elementwise scale parameter γ and shift parameter β that have the
same shape as x. γ and β are parameters are learned jointly with the
other model parameters.
● Batch normalization actively centers and rescales the inputs to each
layer back to a given mean and size.
● Calculate μ̂B and σ̂B

49
Batch Normalization Layers
● Batch normalization implementations for fully-connected layers and
convolutional layers are slightly different.
○ Fully-Connected Layers
■ Insert batch normalization after the affine transformation and before the
nonlinear activation function.
○ Convolutional Layers
■ Apply batch normalization after the convolution and before the nonlinear
activation function.
■ Carry out each batch normalization over the m ·p ·q elements per output
channel simultaneously.
● It operates on a full minibatch at a time.

50
Batch Normalization During Prediction
● After training, use the entire dataset to compute stable estimates of
the variable statistics and then fix them at prediction time.

51
Numerical Problems
(discuss in Webinar)

52
Ref TB Dive into Deep Learning
● Sections 5.4, 5.5, 5.6 and 8,5 (online
version)

53
Next Session:
CNN

54
Deep Learning
DSE Review Session 8
Seetha Parameswaran
BITS Pilani
The author of this deck, Prof. Seetha Parameswaran,
is gratefully acknowledging the authors who made
their course materials freely available online.

2
Midsem Review Questions

3
Question
For a newsfeed classification system with 46 topics, the following two
models are tried. 10,000 unique words are used as input in both models.
20% of the training data is used as the validation set, partial_x_train.
Network Model 1 Network Model 2
model = models.Sequential() model = models.Sequential()
model.add(layers.Dense(64, activation ='relu', model.add(layers.Dense(64, activation = 'relu',
input_shape =(10000,))) input_shape =(10000,)))
model.add(layers.Dense(64, activation ='relu')) model.add(layers.Dense(4, activation ='relu'))
model.add(layers.Dense(46, activation ='softmax')) model.add(layers.Dense(46, activation
='softmax'))
model.compile(optimizer='rmsprop',loss='categorica model.compile(optimizer='rmsprop',loss='categori
l_crossentropy ’, metrics =['accuracy']) cal_crossentropy', metrics =['accuracy'])
model.fit(partial_x_train, partial_y_train, model.fit(partial_x_train, partial_y_train,
epochs=9, batch_size=512, validation_data =(x_val, epochs=20, batch_size=128,
y_val)) validation_data=(x_val, y_val))
4
Solution
What is the number of trainable parameters in the input-hidden layer and
hidden-output layer in Model 2? Which of the networks will give lower
validation error and why?

● # of trainable parameters in the input-hidden layer:


10000*64+64=640064
● # of trainable parameters in the hidden-output layer: 4*46+46=230
● Model 2 will give less validation error since hidden layer 2 is too small
leading to information loss and reduced generalization ability.

5
Question
Hidden node and output node use, respectively, ReLU and
sigmoid activation functions. Bias values at hidden and output
nodes are zero. Weights for the current iteration are given in the
above figure. Target output d is specified as 0. Learning rate is 0.3.
A. Calculate the actual output y for the current iteration with
input (x1, x2) = (1, 1).
B. Calculate the binary cross-entropy error for the current
iteration.
C. Assuming L1 regularization constant = 0.2, calculate the new
w3 for the next iteration.
D. Assuming L2 regularization constant = 0.2, calculate the w1
for the next iteration.
E. Assuming both L1 (with regularization constant=0.2) and L2
(regularization constant=0.2) are applied, calculate the value of w5
in next iteration.
6
Question
A. y=1/(1+1)=1/2
B. L= - (1-log(1/2))=-1-log(2)
C. Let z be the total input to the output node.
y = 1/(1+exp(-z))
w3(t+1) = w3(t) - 0.3*dL/dw3 - 0.2*0.3*sign(w3)
= -0.94 - 0.3*dL/dz*dz/dw3
= -0.94 - 0.3*x1*1/2*1/2 =-0.94-0.3 /4
D. Let h1 be the output of the hidden node.
w1(t+1) = w1(t)-0.3*dL/dh1*dh1/dw1-0.3*0.2*w1
=0.94, since dh1/dw1 =ReLU’(0)=0
E. w5(t+1) = w5(t)-0.3*dL/dz*dz/dw5-0.3*0.2*w5-0.3*0.2*sign(w5) = 0.88
7
Question
A. What is a saddle point? What is the advantage/disadvantage of using
Stochastic Gradient Descent in dealing with saddle points?
B. What are five strategies to prevent overfitting in deep networks, when
used for classification, say of images?
C. What is the difference between kernel regularizers and activity
regularizers?
D.

8
Answer
A. In a multi-dimensional surface, points where all the partial first
derivatives are 0 but some of the partial double derivatives are +ve
and some are –ve are known as saddle points. Or in other words,
Hessian is –ve definitive. At Saddle points, the surface has one or
more minima along some directions and maxima along other
directions.
B. L1/L2 regularization, dropout, data augmentation, early stopping,
adding noise to input/target output.
C. kernel_regularizer: Regularizer to apply a penalty on the layer's
kernel. activity_regularizer: Regularizer to apply a penalty on the
layer's output.

9
Question
Consider the following DNN for image classification for a dataset that
consists of RGB images of size 32x32.
model=models.Sequential()
# Layer1
model.add(layers.Dense( 50,activation='relu',input_shape=**A**))
#Layer2
model.add(layers.Dense( 40,activation= 'relu'))
#Layer3
model.add(layers.Dense( 30,activation= 'relu'))
#Layer4
model.add(layers.Dense(**B**,activation=**C**))
model.compile(optimizer= 'sgd',loss=**D**,metrics=[ 'accuracy' ])

10
Question
A. What is the input shape **A** in Layer 1?
B. What will be the value of **B**, activation function **C** and loss **D**
if the total number of classes in the dataset is (a) 2 (b) 10
C. What will be the total number of parameters in Layer 1, Layer 2 and
Layer 3?
D. If a dropout layer of value 0.5 is added after Layer 2, what will be the
change in the number of parameters?

11
Answer
A. **A** is (32*32*3,) or (3072,)
B. (**B**, **C**, **D**)
(a) 2 is 1, sigmoid, binary_crossentropy
(b) 10 is 10, softmax, categorical_crossentropy
C. What will be the total number of parameters in Layer 1, Layer 2 and
Layer 3?
● Layer 1- 3072*50 + 50 = 153,650
● Layer 2- 50*40 + 40 = 2040
● Layer 3- 40*30 + 30 = 1230
● Total = 153650 + 2040 + 1230 = 156,920
D. No change in the number of parameters if dropout is added.
12
Question
A perceptron structure and the training data are given
below.
Assume the following weights and bias. w1 = 0.41, w2
= 0.23, w3 = 0.5 and b = 0.01.
(a) Compute the output y of the perceptron for the first
training example.
(b) Compute the error.
(c) Update the weights and bias.
(d) Using the updated parameters, compute the output
for the second training example.

13
Solution

14
Question

15
Solution

16
Do it yourself :)

(a) Compute the forward propagation and generate the output. Use Sigmoid activation function.

(b) Compute the RMSE loss.

(c) Let the given weights be at time (t-1). Compute the weights at time t using SGD.

(d) Compute the weights at (t + 1) using Momentum. Assume α = 1.1 and β = 0.8.
17
All the best :)

18

You might also like