Deep Learning Midsem Merged Previous Batch
Deep Learning Midsem Merged Previous Batch
Deep Learning Midsem Merged Previous Batch
AIML Module 1
Seetha Parameswaran
BITS Pilani
The author of this deck, Prof. Seetha Parameswaran,
is gratefully acknowledging the authors who made
their course materials freely available online.
Course Logistics
What we Learn…. (Module Structure)
1. Fundamentals of Neural Network
2. Multilayer Perceptron
3. Deep Feedforward Neural Network
4. Improve the DNN performance by Optimization and Regularization
5. Convolutional Neural Networks
6. Sequence Models
7. Attention Mechanism
8. Neural Network search
9. Time series Modelling and Forecasting
10. Other Learning Techniques
Text book
● Dive into Deep Learning by Aston Zhang, Zack C. Lipton, Mu Li, Alex
J. Smola. https://d2l.ai/chapter_introduction/index.html
Course Logistics
● Refer Canvas for the following
○ Handout
○ Schedule for Webinar
○ Schedule of Quiz, and Assignments.
○ Evaluation scheme
○ Session Slide Deck
○ Demo Lab Sheets
○ Quiz-I, Quiz-II
○ Assignment-I, Assignment-II
○ Sample QPs
● Lecture Recordings
○ Available on Microsoft teams
Honour Code
All submissions for graded components must be the result of your original effort.
It is strictly prohibited to copy and paste verbatim from any sources, whether
online or from your peers. The use of unauthorized sources or materials, as well
as collusion or unauthorized collaboration to gain an unfair advantage, is also
strictly prohibited. Please note that we will not distinguish between the person
sharing their resources and the one receiving them for plagiarism, and the
consequences will apply to both parties equally.
https://colab.research.google.com/drive/1DUVcOoUIWhl8GQKc6AWR1wi
0LaMeNkgD?usp=sharing
Student pl note:
The Python notebook is shared for anyone who has access to the link and the
access is restricted to use BITS email id. So pl do not access from non-BITS
email id and send requests for access.
Exercise
1. Represent OR gate using Perceptron. Compute the parameters of the
perceptron using perceptron learning algorithm.
2. Represent AND gate using Perceptron. Compute the parameters of
the perceptron using perceptron learning algorithm.
Representational Power of Perceptrons
● A perceptron represents a hyperplane decision surface in the
n-dimensional space of examples.
● The perceptron outputs a 1 for examples lying on one side of the
hyperplane and outputs a -1 for examples lying on the other side.
Linearly Separable Data
Linearly Separable Data
● Two sets of data points in a two dimensional space
are said to be linearly separable when they can be
completely separable by a single straight line.
● In general, two groups of data points are
separable in a n-dimensional space if they can be
separated by an (n-1) dimensional hyperplane.
● A straight line can be drawn to separate all the data
examples belonging to class +1 from all the
examples belonging to the class -1. Then the
two-dimensional data are clearly linearly separable.
● An infinite number of straight lines can be drawn to
separate the class +1 from the class -1.
Perceptron for Linearly Separable Data
MLP works :)
● Input (0,0)
○ First neuron = 0*1 + 0*1 = 0 < 1 (threshold). So o/p =0.
○ Second neuron = 0*(-1) + 0*(-1) = 0 > th. So o/p = 1
○ Third neuron = 0*1 + 1*1 = 1 < th. So o/p = 0. The desired output.
● Input (1,0)
○ First neuron = 1*1 + 0*1 = 1 = th. So o/p = 1
○ Second neuron = 1*(-1) + 0*(-1) = -1 = th. So o/p = 1
○ Third neuron = 1*1 + 1*1 = 2 < th. So o/p = 1. The desired output.
Solution of XOR Data
● Data
○ Truth table
● Model
Multi-layered Perceptron.
● Challenge
How to learn the parameters and threshold?
● Solution for learning
Use gradient descent algorithm
Gradient Descent Algorithm
Gradient Descent Algorithm
Gradient Descent Algorithm
Gradient Descent Algorithm
Gradient Descent Algorithm
Numerical Example
Equation is y = (x+5)²
When will it be minimum?
Use gradient descent algorithm
Assume starting point as 3 and
Learning rate as 0.01.
Gradient Descent Algorithm
Contour or hyperplane
Incremental (Stochastic) Gradient Descent
2
Single Perceptron for Regression
Real Valued Output
3
Linear Regression Example
● Suppose that we wish to estimate the prices of houses (in dollars)
based on their area (in square feet) and age (in years).
● The linearity assumption just says that the target (price) can be
expressed as a weighted sum of the features (area and age):
price = warea * area + wage * age + b
● warea and wage are called weights, and b is called a bias.
● The weights determine the influence of each feature on our prediction
and the bias just says what value the predicted price should take
when all of the features take value 0.
4
Data
● Data
○ The dataset is called a training dataset or training set.
○ Each row is called an example (or data point, data instance, sample).
○ The thing we are trying to predict is called a label (or target). \
○ The independent variables upon which the predictions are based are called
features (or covariates).
5
Affine transformations and Linear Models
● The equation of the form
6
Loss Function
● Loss function is a quality measure for some given model or a measure
of fitness.
● The loss function quantifies the distance between the real and
predicted value of the target.
● The loss will usually be a non-negative number where smaller values
are better and perfect predictions incur a loss of 0.
● The most popular loss function in regression problems is the squared
error.
7
Squared Error Loss Function
● The most popular loss function in regression problems is the squared
error.
● For each example,
● For the entire dataset of n examples
○ average (or equivalently, sum) the losses
● When training the model, find parameters (w∗, b∗ ) that minimize the
total loss across all training examples
8
Minibatch Stochastic Gradient Descent (SGD)
● Apply Gradient descent algorithm on a random minibatch of examples
every time we need to compute the update.
● In each iteration,
○ Step 1: randomly sample a minibatch B consisting of a fixed number of training
examples.
○ Step 2: compute the derivative (gradient) of the average loss on the minibatch with
regard to the model parameters.
○ Step 3: multiply the gradient by a predetermined positive value η and subtract the
resulting term from the current parameter values.
9
Training using SGD Algorithm
PS: The number of epochs and the learning rate are both hyperparameters. Setting
hyperparameters requires some adjustment by trial and error.
10
Prediction
● Estimating targets given features is commonly called prediction or
inference.
● Given the learned model, values of target can be predicted, for any
set of features.
11
Single-Layer Neural Network
Linear regression is a single-layer neural network
12
Multiple Perceptrons for Classification
Binary Outputs
13
Classification Example
● Each input consists of a 2 × 2 grayscale image.
● Represent each pixel value with a single scalar, giving four features
x1 , x2, x3 , x4.
● Assume that each image belongs to one among the categories
“square”, “triangle”, and “circle”.
● How to represent the labels?
○ Use label encoding. y ∈ {1, 2, 3}, where the integers represent {circle, square,
triangle} respectively.
○ Use one-hot encoding. y ∈ {(1, 0, 0), (0, 1, 0), (0, 0, 1)}.
■ y would be a three-dimensional vector, with (1, 0, 0) corresponding to “circle”,
(0, 1, 0) to “square”, and (0, 0, 1) to “triangle”.
14
Network Architecture
● A model with multiple outputs, one per class. Each output will
correspond to its own affine function.
○ 4 features and 3 possible output categories
15
Network Architecture
○ 12 scalars to represent the weights and 3 scalars to represent the biases
○ compute three logits, o1, o2, and o3, for each input.
○ weights is a 3×4 matrix and bias is 1×4 matrix
16
Softmax Operation
● Interpret the outputs of our model as probabilities.
○ Any output ŷj is interpreted as the probability that a given item belongs to class j.
Then choose the class with the largest output value as our prediction argmaxj yj .
○ If ŷ1 , ŷ2 , and ŷ3 are 0.1, 0.8, and 0.1, respectively, then predict category 2.
○ To interpret the outputs as probabilities, we must guarantee that, they will be
nonnegative and sum up to 1.
● The softmax function transforms the outputs such that they become
nonnegative and sum to 1, while requiring that the model remains
differentiable.
○ first exponentiate each logit (ensuring non-negativity) and then divide by their sum
(ensuring that they sum to 1)
● Softmax is a nonlinear function.
17
Log-Likelihood Loss Function / Cross-Entropy loss
● The softmax function gives us a vector ŷ, which we can interpret as
estimated conditional probabilities of each class given any input x,
○ ŷ1 = P (y = cat | x).
● Compare the estimates with reality by checking how probable the
actual classes are according to our model, given the features:
18
Multi Layered Perceptrons (MLP)
19
Multilayer Perceptron
● With deep neural networks, use the data to jointly learn both a
representation via hidden layers and a linear predictor that acts upon
that representation.
● Add many hidden layers by stacking many fully-connected layers on
top of each other. Each layer feeds into the layer above it, until we
generate outputs.
● The first (L−1) layers learns the representation and the final layer is
the linear predictor. This architecture is commonly called a multilayer
perceptron (MLP).
20
MLP Architecture
○ MLP has 4 inputs, 3 outputs, and its hidden layer contains 5 hidden units.
○ Number of layers in this MLP is 2.
○ The layers are both fully connected. Every input influences every neuron in the
hidden layer, and each of these in turn influences every neuron in the output layer.
○ The outputs of the hidden layer are called as hidden representations or
hidden-layer variable or a hidden variable.
21
Nonlinearity in MLP
22
Activation Functions
23
Activation function
● Activation function of a neuron defines the output of that neuron given
an input or set of inputs.
● Activation functions decide whether a neuron should be activated or
not by calculating the weighted sum and adding bias with it.
● They are differentiable operators to transform input signals to outputs,
while most of them add non-linearity.
● Artificial neural networks are designed as universal function
approximators, they must have the ability to calculate and learn any
nonlinear function.
24
1. Step Function
25
2. Sigmoid (Logistic) Activation Function
Squashing function as: it squashes any input in the range (-inf, inf) to some value
in the range (0, 1) 26
3. Tanh (hyperbolic tangent) Activation Function
● More efficient because it has a wider range for faster learning and grading.
● The tanh activation usually works better than sigmoid activation function for
hidden units because the mean of its output is closer to zero, and so it
centers the data better for the next layer.
● Issues with tanh
○ computationally expensive
○ lead to vanishing gradients
27
4. ReLU Activation Function
● If we combine two ReLU units, we can recover a piecewise linear approximation of the Sigmoid function.
● Some ReLU variants: Softplus (Smooth ReLU), Noisy ReLU, Leaky ReLU, Parametric ReLU and Exponential ReLU
(ELU).
● Advantages
○ Fast Learning and Efficient computation
○ Fewer vanishing gradient problems
○ Sparse activation
○ Scale invariant (max operation)
● Disadvantages
○ Leads to exploding gradient.
28
Comparing Activation Functions
29
Training MLP
30
Two Layer Neural Network
31
Compute the Activations
32
Vectoring Forward Propagation
33
Neural Network Training – Forward Pass
34
Neural Network Training – Forward Pass
35
Neural Network Training – Forward Pass
36
37
38
39
40
41
42
43
44
45
46
Forward Propagation Algorithm
47
Computation Graph for Forward Pass
48
Cost Function
49
Neural Network Training – Backward Pass
50
Neural Network Training – Backward Pass
51
Neural Network Training – Backward Pass
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
Computing the Gradients
76
Backpropagation Algorithm to compute Gradients
77
Computation Graph for BackProp
78
Neural Network Training – Update Parameters
79
Parameter Updation
80
Training of MultiLayer Perceptron (MLP)
Requires
1. Forward Pass through each layer to compute the output.
2. Compute the deviation or error between the desired output and
computed output in the forward pass (first step). This morphs into
objective function, as we want to minimize this deviation or error.
3. The deviation has to be send back through each layer to compute the
delta or change in the parameter values. This is achieved using back
propagation algorithm.
4. Update the parameters.
81
82
Scaling up for L layers in MLP
83
Forward Propagation Algorithm
84
Backward Propagation Algorithm
85
Update the Parameters
86
Ref:
Chapter 3 and 4 of T1
87
Next Session:
Power of MLP
88
Deep Learning
DSE Module 2
Seetha Parameswaran
BITS Pilani
The author of this deck, Prof. Seetha Parameswaran,
is gratefully acknowledging the authors who made
their course materials freely available online.
2
Complex Boundaries Using MLP
3
Composing Complex Decision Boundaries
4
Boolean over the Reals
5
Boolean over the Reals
6
Boolean over the Reals
7
Boolean over the Reals
8
Boolean over the Reals
9
Boolean over the Reals
10
More Complex Boundaries
11
More Complex Boundaries
12
Example
13
Question
(x1, x2) are input features and target classes are
either +1 or -1 as shown in the figure.
A. What is the minimum number of hidden layers and
hidden nodes required to classify the following
dataset with 100% accuracy using a fully connected
multilayer perceptron network? Step activation
functions are used at all nodes, i.e., output=+1 if total
weighted input >= bias b at a node, else output = -1.
B. Show the minimal network architecture by
organizing the nodes in each layer horizontally. Show
the node representing x1 at the left on the input layer.
Organize the hidden nodes in ascending order of
bias at that node. Specify all weights and bias values
at all nodes. Weights can be only -2.5, 2.5 or 0, and
bias +ve/-ve multiples of 2.5.
14
Solution
A. 2 hidden layers, 4 nodes in first hidden layer and 2 nodes in second
hidden layer needed.
B.
15
MLP as Universal Boolean Functions
16
Multi Layered Perceptrons for Boolean Functions
17
How many layers for a Boolean MLP?
18
How many layers for a Boolean MLP?
19
How many layers for a Boolean MLP?
20
How many layers for a Boolean MLP?
21
How many layers for a Boolean MLP?
22
How many layers for a Boolean MLP?
23
How many layers for a Boolean MLP?
24
How many layers for a Boolean MLP?
25
How many layers for a Boolean MLP?
26
How many layers for a Boolean MLP?
27
Reducing a Boolean Function
28
Reducing a Boolean Function
29
Reducing a Boolean Function
30
Largest irreducible DNF?
31
Largest irreducible DNF?
32
Width of a Single Layer Boolean MLP
33
Width of a Single Layer Boolean MLP
34
Width of a Single Layer Boolean MLP
35
Width of deep MLP
36
MLP XOR
37
Width of deep MLP
38
Width of deep MLP
39
Width of deep MLP
40
Width of Single Layer Boolean MLP
41
A better representation…
42
Challenge of Depth
43
Need of Depth
44
Network Size
45
46
Question
How many perceptrons are required to represent W ⊕ X ⊕ Y ⊕ Z ?
47
Solution
48
Solution
49
Computational Graph
50
Computational Graph: Example
51
Computational Graph for Logistic Regression
52
Computational Graph for Back Propagation
53
Computational Graph for Back Propagation
54
Computational Graph for Back Propagation
55
Computational Graph for Back Propagation
56
Computational Graph for Back Propagation
57
Computational Graph for Back Propagation
58
Computational Graph for Back Propagation
59
Computational Graph for Back Propagation
60
Computational Graph for Back Propagation
61
Computational Graph for Back Propagation
62
Computational Graph for Back Propagation
63
Computational Graph for Back Propagation
64
Computational Graph for Back Propagation
65
Computational Graph for Back Propagation
66
Computational Graph for Back Propagation
67
Computational Graph for Back Propagation
68
Computational Graph for Back Propagation
69
Question 8
70
Solution for Q8
71
Solution for Q8
72
Question 7
Draw the computational graph for Sigmoid function and show the gradient
computation also. Use generic equations.
73
Demo of DNN
1. XOR Implementation
https://colab.research.google.com/drive/1xVVpeU3q4bIOexV0J3NhYb
LVCHwaOl6R#scrollTo=GRaiuHtKI1Sq
74
Example with Relu
75
Example with Relu
76
Question
77
Solution
78
Question
79
80
Ref:
● http://mlsp.cs.cmu.edu/people/rsingh/docs/
Chapter1_Introduction.pdf
● http://mlsp.cs.cmu.edu/people/rsingh/docs/
Chapter2_UniversalApproximators.pdf
81
Next Session:
Mod3: Optimization
Refresh: Calculus
82
Deep Neural Network
AIML Module 3
Seetha Parameswaran
BITS Pilani
1
The author of this deck, Prof. Seetha Parameswaran,
is gratefully acknowledging the authors who made
their course materials freely available online.
2
Optimization
3
What we Learn….
3.1 Challenges in Neural Network Optimization – saddle points and
plateau
3.2 Non-convex optimization intuition
3.3 Overview of optimization algorithms
3.4 Momentum based algorithms
3.5 Algorithms with Adaptive Learning Rates
4
Optimization Algorithms
5
Optimization Algorithm
● Optimization algorithms train deep learning models.
● Optimization algorithms are the tools that allow
○ continue updating model parameters
○ to minimize the value of the loss function, as evaluated on the
training set.
● In optimization, a loss function is often referred to as the objective
function of the optimization problem.
● By tradition and convention most optimization algorithms are concerned
with minimization.
6
Optimization
7
Why Optimization Algorithm?
● The performance of the optimization algorithm directly affects the
modelʼs training efficiency.
● Understanding the principles of different optimization algorithms and the
role of their hyperparameters will enable us to tune the
hyperparameters in a targeted manner to improve the performance of
deep learning models.
● The goal of optimization is to reduce the training error. The goal of
deep learning is to reduce the generalization error, this requires reduction
in overfitting also.
8
Optimization Challenges in Deep Learning
● Local minima
● Saddle points
● Vanishing gradients
9
Local Minima
● For any objective function f (x), if
the value of f (x) at x is smaller
than the values of f (x) at any other
points in the vicinity of x, then f (x)
could be a local minimum.
● If the value of f (x) at x is the
minimum of the objective function
over the entire domain, then f (x) is
the global minimum.
● In minibatch stochastic gradient
descent, the natural variation of
gradients over minibatches is able
to dislodge the parameters from
local minima.
10
Finding Minimum of a Function
11
Finding Minimum of a Function
12
Derivatives of a Function
13
Saddle points
● A saddle point is any location where
all gradients of a function vanish
but which is neither a global nor a
local minimum.
● Eg: f (x, y) = x^2 − y^2
○ saddle point at (0, 0)
○ Maximum wrt y
○ Minimum wrt x
14
Derivatives at Saddle (Inflection) Point
15
Derivatives at Saddle (Inflection) Point
16
Functions of Multiple Variables
17
Gradient of a Scalar Function
18
Gradient of a Scalar Function with multiple variables
19
Hessian
20
Gradient of a Scalar Function with multiple variables
21
Solution of Unconstrained Minimization
1. Solve for X where the gradient equation equal to zero.
⛛f(x) = 0
2. Compute the Hessian matrix ⛛2f(x) at the candidate solution and verify
that
○ Local Minimum
■ Eigenvalues of Hessian matrix are all positive
○ Local Maximum
■ Eigenvalues of Hessian matrix are all negative
○ Saddle Point
■ Eigenvalues of Hessian matrix at the zero-gradient position are
negative and positive
22
Example
23
Example
24
Example
25
Vanishing gradients
● Function f (x) = tanh(x)
● f ′(x) = 1 − tanh2(x)
○ f ′(4) = 0.0013.
● Gradient of f is close to nil.
● Vanishing gradients can
cause optimization to stall.
○ Reparameterization of
the problem helps.
○ Good initialization of the
parameters
26
Gradient Descent
27
How to find Global Minima?
28
Find Global Minima Iteratively
29
Approach of Gradient Descent
30
Approach of Gradient Descent
31
Approach of Gradient Descent
gd(eta, grad)
new = old - eta * grad
32
Approach of Gradient Descent
● First order Gradient Descent algos consider the first order derivatives
to get the value (magnitude) and direction of update.
34
Learning Rate
● The role of the learning rate is to moderate the degree to which
weights are changed at each step.
● Learning rate η is set by the algorithm designer.
● If the learning rate that is too small, it will cause x to update very
slowly, requiring more iterations to get a better solution.
● If the learning rate that is too large, the solution oscillates and in the
worst case it might even diverge.
35
Learning rate and Gradient Descent
37
Stochastic Gradient Descent
● In deep learning, the objective function is the average of the loss functions
for each example in the training dataset.
● Given a training dataset of n examples, let fi (x) is the loss function with
respect to the training example of index i, where x is the parameter vector.
● The objective function
● Update x as
39
Dynamic Learning Rate
40
Dynamic Learning Rate
● Replace η with a time-dependent learning rate η(t)
○ adds to the complexity of controlling convergence of an optimization algorithm.
● A few basic strategies that adjust η over time.
1. Piecewise constant
a. decrease the learning rate, e.g., whenever progress in optimization stalls.
b. This is a common strategy for training deep networks. A
2. Exponential decay
a. Leads to premature stopping before the algorithm has converged.
3. Polynomial decay with α = 0.5. 41
Dynamic Learning Rate
● Replace η with a time-dependent learning rate η(t)
○ adds to the complexity of controlling convergence of an optimization algorithm.
● A few basic strategies that adjust η over time.
42
Exponential Decay
def exponential_lr():
global t
t += 1
return math.exp(-0.1 * t)
43
Polynomial decay
def polynomial_lr():
global t
t += 1
return (1 + 0.1 * t) ** (-0.5)
44
Review
● Gradient descent
○ Uses the full dataset to compute gradients and to update parameters, one pass
at a time.
○ Gradient Descent is not particularly data efficient whenever data is very similar.
● Stochastic Gradient descent
○ Processes one observation at a time to make progress.
○ Stochastic Gradient Descent is not particularly computationally efficient since
CPUs and GPUs cannot exploit the full power of vectorization.
○ For noisy gradients, choice of the learning rate is critical.
■ If we decrease it too rapidly, convergence stalls.
■ If we are too lenient, we fail to converge to a good enough solution since
noise keeps on driving us away from optimality.
● MInibatch SGD
○ Accelerate computation, or better or computational efficiency.
○ Averaging gradients reduce the amount of variance.
45
Minibatch Stochastic Gradient Descent
46
Minibatch Stochastic Gradient Descent
● In each iteration, we first randomly sample a minibatch B consisting of
a fixed number of training examples.
● We then compute the derivative (gradient) of the average loss on the
minibatch with regard to the model parameters.
● Finally, we multiply the gradient by a predetermined positive value η
and subtract the resulting term from the current parameter values.
47
Minibatch Stochastic Gradient Descent Algorithm
● Gradients at time t is calculated as
48
SGD Algorithm
49
Momentum
50
Leaky Average in Minibatch SGD
● Replace the gradient computation by a “leaky average “ for better
variance reduction. β ∈ (0, 1).
52
Momentum Method Example
● Consider a moderately distorted ellipsoid objective
● f has its minimum at (0, 0). This function is very flat in x1 direction.
● For eta = 0.4. Without momentum ● For eta = 0.6. Without momentum
● The gradient in the x2 direction ● Convergence in the x1 direction
oscillates than in the horizontal x1 improves but the overall solution
direction. quality is diverging. 53
Momentum Method Example
● Consider a moderately distorted ellipsoid objective
● Apply momentum for eta = 0.6
55
Adagrad
56
Adagrad
● Used for features that occur infrequently (sparse features)
● Adagrad uses aggregate of the squares of previously observed
gradients.
57
Adagrad Algorithm
● Variable st to accumulate past gradient variance.
58
Adagrad: Summary
● Adagrad decreases the learning rate dynamically on a per-coordinate
basis.
● It uses the magnitude of the gradient as a means of adjusting how
quickly progress is achieved - coordinates with large gradients are
compensated with a smaller learning rate.
● If the optimization problem has a rather uneven structure Adagrad can
help mitigate the distortion.
● Adagrad is particularly effective for sparse features where the learning
rate needs to decrease more slowly for infrequently occurring terms.
● On deep learning problems Adagrad can sometimes be too
aggressive in reducing learning rates.
59
RMSProp
60
RMSProp
● Adagrad use learning rate that decreases at a predefined schedule of
effectively O(t− 1/2 ).
● RMSProp algorithm decouples rate scheduling from
coordinate-adaptive learning rates. This is essential for non-convex
optimization.
61
RMSProp Algorithm
● Use leaky average to accumulate past gradient variance.
62
RMSProp Algorithm
63
RMSProp Example
ef rmsprop_2d(x1, x2, s1, s2):
g1, g2, eps = 0.2 * x1, 4 * x2, 1e-6
s1 = gamma * s1 + ( 1 - gamma) * g1 ** 2
s2 = gamma * s2 + ( 1 - gamma) * g2 ** 2
x1 -= eta / math.sqrt(s1 + eps) * g1
x2 -= eta / math.sqrt(s2 + eps) * g2
return x1, x2, s1, s2
64
Adagrad: Summary
● RMSProp is very similar to Adagrad as both use the square of the
gradient to scale coefficients.
● RMSProp shares with momentum the leaky averaging. However,
RMSProp uses the technique to adjust the coefficient-wise
preconditioner.
● The learning rate needs to be scheduled by the experimenter in
practice.
● The coefficient γ (gamma) determines how long the history is when
adjusting the per-coordinate scale.
65
Adam
66
Review of techniques learned so far
1. Stochastic gradient descent
○ more effective than Gradient Descent when solving optimization problems, e.g.,
due to its inherent resilience to redundant data.
2. Minibatch Stochastic gradient descent
○ affords significant additional efficiency arising from vectorization, using larger sets
of observations in one minibatch. This is the key to efficient multi-machine,
multi-GPU and overall parallel processing.
3. Momentum
○ added a mechanism for aggregating a history of past gradients to accelerate
convergence.
4. Adagrad
○ used per-coordinate scaling to allow for a computationally efficient preconditioner.
5. RMSProp
○ decoupled per-coordinate scaling from a learning rate adjustment.
67
Adam
● Adam combines all these techniques into one efficient learning
algorithm.
● More robust and effective optimization algorithms to use in deep
learning.
● Adam can diverge due to poor variance control. (disadvantage)
● Adam uses exponential weighted moving averages (also known as
leaky averaging) to obtain an estimate of both the momentum and
also the second moment of the gradient.
68
Adam Algorithm
● State variables
69
Adam Algorithm
● Rescale the gradient
● Compute updates
70
Adam Algorithm
71
Adam: Summary
● Adam combines features of many optimization algorithms into a fairly
robust update rule.
● Adam uses bias correction to adjust for a slow startup when
estimating momentum and a second moment.
● For gradients with significant variance we may encounter issues with
convergence. They can be amended by using larger minibatches or
by switching to an improved estimate for state variables. Yogi
algorithm offers such an alternative.
72
Learning Rate: Summary
1. Adjusting the learning rate is often just as important as the actual
algorithm.
2. Magnitude of the learning rate matters. If it is too large, optimization
diverges, if it is too small, it takes too long to train or we end up with a
suboptimal result. Momentum algo helps.
3. The rate of decay is just as important. If the learning rate remains
large we may simply end up bouncing around the minimum and thus
not reach optimality. We want the rate to decay, but probably more
slowly than O(t− 1/ 2 ) .
4. Initialization pertains both to how the parameters are set initially and
also how they evolve initially. This is known as warmup, i.e., how
rapidly we start moving towards the solution initially.
73
Numerical Problems
74
Question with Solution
75
Question
1. Compute the value that minimizes (w1 , w2). Compute the minimum
possible value of error.
2. What will be value of (w1 , w2 ) at time (t + 1) if standard gradient descent is
used?
3. What will be value of (w1 , w2 ) at time (t + 1) if momentum is used?
4. What will be value of (w1 , w2 ) at time (t + 1) if RMSPRop is used?
5. What will be value of (w1 , w2 ) at time (t + 1) if Adam is used?
76
Solution
77
Solution
78
Solution
79
Ref TB Dive into Deep Learning
● Chapter 12 (online version)
80
Next Session:
Regularization
81
Deep Neural Network
AIML Module 5
Seetha Parameswaran
BITS Pilani
1
The author of this deck, Prof. Seetha Parameswaran,
is gratefully acknowledging the authors who made
their course materials freely available online.
2
Regularization Techniques
3
What we Learn….
4.1 Model Selection
4.2 Underfitting, and Overfitting
4.3 L1 and L2 Regularization
4.4 Dropout
4.5 Challenges - Vanishing and Exploding Gradients, Covariance shift
4.6 Parameter Initialization
4.7 Batch Normalization
4
Generalization in DNN
5
Generalization
● Goal is to discover patterns that generalize.
○ The goal is to discover patterns that capture regularities in the
underlying population from which our training set was drawn.
○ Models are trained a sample of data.
○ When working with finite samples, we run the risk that we might
discover apparent associations that turn out not to hold up when we
collect more data or on newer samples.
● The trained model should predict for newer or unseen data. This problem
is called generalization.
6
Training Error and Generalization Error
● Training error is the error of our model as calculated on the training
dataset.
○ Obtained while training the model.
● Generalization error is the expectation of our modelʼs error, if an
infinite stream of additional data examples drawn from the same
underlying data distribution as the original sample were applied on the
model.
○ Cannot be computed, but estimated.
○ Estimate the generalization error by applying the model to an independent test set,
constituted of a random selection of data examples that were withheld from the
training set.
7
Model Complexity
● Simple models and abundant data
○ Expect the generalization error to resemble the training error.
● More complex models and fewer examples
○ Expect the training error to go down but the generalization gap to grow.
● Model complexity
○ A model with more parameters might be considered more complex.
○ A model whose parameters can take a wider range of values might be more
complex.
○ A neural network model that takes more training iterations are more complex, and
one subject to early stopping (fewer training iterations) are less complex.
●
8
Factors that influence the generalizability of a model
1. The number of tunable parameters.
○ When the number of tunable parameters, called the degrees of freedom, is large,
models tend to be more susceptible to overfitting.
2. The values taken by the parameters.
○ When weights can take a wider range of values, models can be more susceptible
to overfitting.
3. The number of training examples.
○ It is trivially easy to overfit a dataset containing only one or two examples even if
your model is simple. But overfitting a dataset with millions of examples requires
an extremely flexible model.
4.
9
Model Selection
● Model selection is the process of selecting the final model after
evaluating several candidate models.
● With MLPs, compare models with
○ different numbers of hidden layers,
○ different numbers of hidden units
○ different activation functions applied to each hidden layer.
● Use Validation dataset to determine the best among our candidate
models.
10
Validation dataset
● Never rely on the test data for model selection.
○ Risk of overfit the test data
● Do not rely solely on the training data for model selection
○ We cannot estimate the generalization error on the very data that we use to train
the model.
● Split the data three ways, incorporating a validation dataset (or
validation set) in addition to the training and test datasets.
● In deep learning, with millions of data available, the split is generally
○ Training = 98-99 % of the original dataset
○ Validation = 1-2 % of training dataset
○ Testing = 1-2 % of the original dataset
11
Just Right Model
● High Training accuracy
● High Validation accuracy
● Low Bias and Low Variance
● Usually care more about the
validation error than about the gap
between the training and validation
errors.
12
Underfitting
● Low Training accuracy and Low Validation accuracy.
● Training error and validation error are both substantial but there is a little gap
between them.
● The model is too simple (insufficiently expressive) to capture the pattern that we
are trying to model.
● If generalization gap between our training and validation errors is small, a more
complex model may be better.
13
Overfitting
● The phenomenon of fitting the
training data more closely than fit
the underlying distribution is called
overfitting.
● High Training accuracy and Low
Validation accuracy
● Training error is significantly lower
than the validation error.
● The techniques used to combat
overfitting are called
regularization.
14
Underfitting or Overfitting?
15
Polynomial degree and underfitting vs. overfitting
16
Model complexity and dataset size
● More data, fit a more complex model.
● More data, the generalization error typically decreases.
17
Deep Learning Model Selection
18
Regularization
19
Regularization Techniques
● Weight Decay ( L2 regularization)
● Dropout
● Early Stopping
20
Weight Decay
21
L2 Regularization
● Measure the complexity of a linear function f (x) = w⊤x by some norm
of its weight vector, e.g., ∥w∥2.
● Add the norm as a penalty term to the problem of minimizing the loss.
This will ensure that the weight vector is small.
● The objective function becomes minimizing the sum of the
prediction loss and the penalty term.
● L2-regularized linear models constitute the ridge regression algorithm.
22
L2 Regularization
● The trade off between standard loss and the additive penalty is given by
regularization constant λ, a non-negative hyperparameter.
24
L2 Regularization L1 Regularization
Sum of square of weights Sum of absolute value of weights
25
Dropout
26
Smoothness
● Classical generalization theory suggests that to close the gap
between train and test performance, aim for a simple model.
● Simplicity can be achieved
○ Using weight decay
○ Smoothness, i.e., that the function should not be sensitive to small changes to its
inputs.
● Injecting noise enforces smoothness
○ training with input noise
○ inject noise into each layer of the network before calculating the subsequent layer
during training.
27
Dropout
● Dropout involves injecting noise while computing each internal layer
during forward propagation.
● It has become a standard technique for training neural networks.
● The method is called dropout because we literally drop out some
neurons during training.
● Apply dropout to a hidden layer, zeroing out each hidden unit with
probability p.
● The calculation of the outputs no longer depends on dropped out
neurons and their respective gradient also vanishes when performing
backpropagation. (in that iteration)
● Dropout is disabled at test time.
28
Without Dropout
29
Dropout for first iteration
30
Dropout for second iteration
31
Dropout
● Dropout gives a smaller neural network, giving the effect of
regularization.
● In general,
○ Vary keep probability (0.5 to 0.8) for each hidden layer.
○ The input layer has a keep probability of 1.0 or 0.9.
○ The output layer has a keep probability of 1.0.
●
32
Early Stopping
33
Before Early stopping
● When training large models, training error decreases steadily over
time, but validation set error begins to rise again.
● Training objective decreases consistently over time.
● Validation set average loss begins to increase again, forming an
asymmetric U-shaped curve.
34
Early stopping
● No longer looking for a local minimum of validation error, while
training.
● Train until the validation set error has not improved for some amount
of time.
● Every time the error on the validation set improves, store a copy of the
model parameters. When the training algorithm terminates, return
these parameters.
35
Early stopping
● Effective and simple form of regularization.
● Trains simpler models
36
Early Stopping code
37
Numerical Stability and Initialization
38
Why Initialization is important?
● The choice of initialization is crucial for maintaining numerical stability.
● The choices of initialization can be tied up in interesting ways with the
choice of the nonlinear activation function.
● Which function we choose and how we initialize parameters can
determine how quickly our optimization algorithm converges.
● Poor choices can cause to encounter exploding or vanishing gradients
while training.
39
Vanishing and Exploding Gradients
● Consider a deep network with L layers, input x and output o. With
each layer l defined by a transformation fl parameterized by weights
W(l) , whose hidden variable is h(l)
● If all the hidden variables and the input are vectors, then the gradient
of o with respect to any set of parameters W(l)
41
Vanishing Gradients
● Activation function sigmoid σ can cause the vanishing gradient
problem.
○ The sigmoidʼs gradient vanishes both when its inputs are large and when they are
small.
○ When backpropagating through many layers, where the inputs to many of the
sigmoids are close to zero, the gradients of the overall product may vanish.
● Solution: Use ReLU for hidden lakers. ReLU is more stable.
42
Parameter Initialization
1. Default Initialization
○ Used a normal distribution to initialize the values of the parameters.
2. Xavier Initialization
○ samples weights from a Gaussian distribution with zero mean and variance
43
Batch Normalization
44
Why Batch Normalization?
1. Standardize the input features to each have a mean of zero and
variance of one. This standardization puts the parameters a priori at a
similar scale. Better optimization.
2. A MLP or CNN, as we train, the variables in intermediate layers may
take values with widely varying magnitudes: both along the layers
from the input to the output, across units in the same layer, and over
time due to our updates to the model parameters. This drift in the
distribution of such variables could hamper the convergence of the
network.
3. Deeper networks are complex and easily capable of overfitting. This
means that regularization becomes more critical.
45
Batch Normalization
● Batch normalization is a popular and effective technique that
consistently accelerates the convergence of deep networks.
● Batch normalization is applied to individual layers.
● It works as follows:
○ In each training iteration, first normalize the inputs (of batch normalization) by
subtracting their mean and dividing by their standard deviation, where both
are estimated based on the statistics of the current minibatch.
○ Next, apply a scale coefficient and a scale offset.
● Due to the normalization based on batch statistics that batch
normalization derives its name.
● Batch normalization works best for moderate minibatches sizes in the
50 to 100 range.
46
Batch Normalization
● Denote by x ∈ B an input to batch normalization (BN) that is from a
minibatch B, batch normalization transforms x as
● μ̂B is the sample mean and σ̂B is the sample standard deviation of the
minibatch B.
● After applying standardization, the resulting minibatch has zero mean
and unit variance.
47
Batch Normalization
● Denote by x ∈ B an input to batch normalization (BN) that is from a
minibatch B, batch normalization transforms x as
● μ̂B is the sample mean and σ̂B is the sample standard deviation of the
minibatch B.
● After applying standardization, the resulting minibatch has zero mean
and unit variance.
48
Batch Normalization
● Elementwise scale parameter γ and shift parameter β that have the
same shape as x. γ and β are parameters are learned jointly with the
other model parameters.
● Batch normalization actively centers and rescales the inputs to each
layer back to a given mean and size.
● Calculate μ̂B and σ̂B
49
Batch Normalization Layers
● Batch normalization implementations for fully-connected layers and
convolutional layers are slightly different.
○ Fully-Connected Layers
■ Insert batch normalization after the affine transformation and before the
nonlinear activation function.
○ Convolutional Layers
■ Apply batch normalization after the convolution and before the nonlinear
activation function.
■ Carry out each batch normalization over the m ·p ·q elements per output
channel simultaneously.
● It operates on a full minibatch at a time.
50
Batch Normalization During Prediction
● After training, use the entire dataset to compute stable estimates of
the variable statistics and then fix them at prediction time.
51
Numerical Problems
(discuss in Webinar)
52
Ref TB Dive into Deep Learning
● Sections 5.4, 5.5, 5.6 and 8,5 (online
version)
53
Next Session:
CNN
54
Deep Learning
DSE Review Session 8
Seetha Parameswaran
BITS Pilani
The author of this deck, Prof. Seetha Parameswaran,
is gratefully acknowledging the authors who made
their course materials freely available online.
2
Midsem Review Questions
3
Question
For a newsfeed classification system with 46 topics, the following two
models are tried. 10,000 unique words are used as input in both models.
20% of the training data is used as the validation set, partial_x_train.
Network Model 1 Network Model 2
model = models.Sequential() model = models.Sequential()
model.add(layers.Dense(64, activation ='relu', model.add(layers.Dense(64, activation = 'relu',
input_shape =(10000,))) input_shape =(10000,)))
model.add(layers.Dense(64, activation ='relu')) model.add(layers.Dense(4, activation ='relu'))
model.add(layers.Dense(46, activation ='softmax')) model.add(layers.Dense(46, activation
='softmax'))
model.compile(optimizer='rmsprop',loss='categorica model.compile(optimizer='rmsprop',loss='categori
l_crossentropy ’, metrics =['accuracy']) cal_crossentropy', metrics =['accuracy'])
model.fit(partial_x_train, partial_y_train, model.fit(partial_x_train, partial_y_train,
epochs=9, batch_size=512, validation_data =(x_val, epochs=20, batch_size=128,
y_val)) validation_data=(x_val, y_val))
4
Solution
What is the number of trainable parameters in the input-hidden layer and
hidden-output layer in Model 2? Which of the networks will give lower
validation error and why?
5
Question
Hidden node and output node use, respectively, ReLU and
sigmoid activation functions. Bias values at hidden and output
nodes are zero. Weights for the current iteration are given in the
above figure. Target output d is specified as 0. Learning rate is 0.3.
A. Calculate the actual output y for the current iteration with
input (x1, x2) = (1, 1).
B. Calculate the binary cross-entropy error for the current
iteration.
C. Assuming L1 regularization constant = 0.2, calculate the new
w3 for the next iteration.
D. Assuming L2 regularization constant = 0.2, calculate the w1
for the next iteration.
E. Assuming both L1 (with regularization constant=0.2) and L2
(regularization constant=0.2) are applied, calculate the value of w5
in next iteration.
6
Question
A. y=1/(1+1)=1/2
B. L= - (1-log(1/2))=-1-log(2)
C. Let z be the total input to the output node.
y = 1/(1+exp(-z))
w3(t+1) = w3(t) - 0.3*dL/dw3 - 0.2*0.3*sign(w3)
= -0.94 - 0.3*dL/dz*dz/dw3
= -0.94 - 0.3*x1*1/2*1/2 =-0.94-0.3 /4
D. Let h1 be the output of the hidden node.
w1(t+1) = w1(t)-0.3*dL/dh1*dh1/dw1-0.3*0.2*w1
=0.94, since dh1/dw1 =ReLU’(0)=0
E. w5(t+1) = w5(t)-0.3*dL/dz*dz/dw5-0.3*0.2*w5-0.3*0.2*sign(w5) = 0.88
7
Question
A. What is a saddle point? What is the advantage/disadvantage of using
Stochastic Gradient Descent in dealing with saddle points?
B. What are five strategies to prevent overfitting in deep networks, when
used for classification, say of images?
C. What is the difference between kernel regularizers and activity
regularizers?
D.
8
Answer
A. In a multi-dimensional surface, points where all the partial first
derivatives are 0 but some of the partial double derivatives are +ve
and some are –ve are known as saddle points. Or in other words,
Hessian is –ve definitive. At Saddle points, the surface has one or
more minima along some directions and maxima along other
directions.
B. L1/L2 regularization, dropout, data augmentation, early stopping,
adding noise to input/target output.
C. kernel_regularizer: Regularizer to apply a penalty on the layer's
kernel. activity_regularizer: Regularizer to apply a penalty on the
layer's output.
9
Question
Consider the following DNN for image classification for a dataset that
consists of RGB images of size 32x32.
model=models.Sequential()
# Layer1
model.add(layers.Dense( 50,activation='relu',input_shape=**A**))
#Layer2
model.add(layers.Dense( 40,activation= 'relu'))
#Layer3
model.add(layers.Dense( 30,activation= 'relu'))
#Layer4
model.add(layers.Dense(**B**,activation=**C**))
model.compile(optimizer= 'sgd',loss=**D**,metrics=[ 'accuracy' ])
10
Question
A. What is the input shape **A** in Layer 1?
B. What will be the value of **B**, activation function **C** and loss **D**
if the total number of classes in the dataset is (a) 2 (b) 10
C. What will be the total number of parameters in Layer 1, Layer 2 and
Layer 3?
D. If a dropout layer of value 0.5 is added after Layer 2, what will be the
change in the number of parameters?
11
Answer
A. **A** is (32*32*3,) or (3072,)
B. (**B**, **C**, **D**)
(a) 2 is 1, sigmoid, binary_crossentropy
(b) 10 is 10, softmax, categorical_crossentropy
C. What will be the total number of parameters in Layer 1, Layer 2 and
Layer 3?
● Layer 1- 3072*50 + 50 = 153,650
● Layer 2- 50*40 + 40 = 2040
● Layer 3- 40*30 + 30 = 1230
● Total = 153650 + 2040 + 1230 = 156,920
D. No change in the number of parameters if dropout is added.
12
Question
A perceptron structure and the training data are given
below.
Assume the following weights and bias. w1 = 0.41, w2
= 0.23, w3 = 0.5 and b = 0.01.
(a) Compute the output y of the perceptron for the first
training example.
(b) Compute the error.
(c) Update the weights and bias.
(d) Using the updated parameters, compute the output
for the second training example.
13
Solution
14
Question
15
Solution
16
Do it yourself :)
(a) Compute the forward propagation and generate the output. Use Sigmoid activation function.
(c) Let the given weights be at time (t-1). Compute the weights at time t using SGD.
(d) Compute the weights at (t + 1) using Momentum. Assume α = 1.1 and β = 0.8.
17
All the best :)
18