CNN Basic Structure, Hyper-Parameter Tuning, Regularization-Dropouts
CNN Basic Structure, Hyper-Parameter Tuning, Regularization-Dropouts
Foundations of DataScience
(SEMINAR PRESENTATION)
1
CNN Basic Structure
Hyper-parameter tuning
Parameters to be tuned
• Activation Function
• Optimizer
• Learning Rate
• Batch size
• Epochs
• Number of Layers
Choice of Activation function
Softmax Activation
Optimization Considerations
• Increase accuracy
• Decrease loss function and Cost Function
• The loss function calculates the error per observation
• The cost function calculates the error over the whole dataset.
Optimization : Gradient Descent
• Gradient of a function at any point is the direction of steepest
increase or ascent of the function at that point
Optimizer
• Traditional Gradient Descent
Adam Optimizer
• Adaptive Moment
Estimation is an algorithm
for optimization
technique for gradient
descent.
• Intuitively, it is a
combination of the
‘gradient descent with
momentum’ algorithm
and the ‘RMSP’(Root
Mean Square
Propagation) algorithm.
Backpropagation
Learning Rate
Batch size, Number of epochs
• The batch size defines the number of samples that will be propagated
through the network.
• one epoch = one forward pass and one backward pass of all the training
examples
• The higher the batch size, the more memory space you'll need.
• number of iterations = number of passes, each pass using [batch size]
number of examples. To be clear, one pass = one forward pass + one
backward pass (we do not count the forward pass and backward pass as
two different passes).
• Example: if you have 1000 training examples, and your batch size is 500,
then it will take 2 iterations to complete 1 epoch.
Layers
• Regularization
• Dropout layers
• Max Pooling
• Flattening
Regularization
• Regularization is a way of adding some constraints or penalties to the
model, so that it does not overfit the training data.
• There are different types of regularization methods, but they all aim to
reduce the variance of the model and increase its bias.
• Variance measures how sensitive the model is to small changes in the data,
while bias measures how far the model is from the true relationship. A
good model should have low variance and low bias, but there is usually a
trade-off between them.
• Regularization helps find a balance between them by shrinking or pruning
the model parameters, adding noise or dropout to the layers, or
augmenting the data with transformations.
Regularization parameters
• https://developers.google.com/ma
chine-learning/crash-course/regula
rization-for-simplicity/l2-regularizati
on
Bias and Variance
Bias and Variance tradeoff
• High Variance:
• Try getting smaller set of features
• Increasing Lambda
• Get more training examples using Data Augmentation
• High Bias:
• Try getting additional features
• Try adding polynomial features
• Try decreasing lambda
Dropout, Max pooling layers
• Dropout is a regularization technique used to reduce over-fitting on
neural networks.
• Usually, deep learning models use dropout on the fully connected
layers, but is also possible to use dropout after the max-pooling
layers, creating image noise augmentation.
• The max pooling layers down sample the data. And dropout forces the
neural network to learn in a more robust way.
• Fully Connected layer takes input from Flatten Layer which is a one-
dimensional layer (1D Layer).
Probable Questions and
Numericals
Theoretical question
Describe the characteristics of the activation function1. How does its behavior influence the learning
process and the expressiveness of the neural network?
• Explain about the activation function given in the prob.
• Plot the graph for the activation function and analyze the dataset which would be suitable for such
kind of results.
• Once, the type of problem is identified, explain the relation between that particular activation
function and the solution of the problem
• Explain the relevance of the activation function in solving the problem.
Refer: https://www.analyticsvidhya.com/blog/2020/01/fundamentals-deep-learning-activation-
functions-when-to-use-them/
Theoretical question
What is the role of an optimizer in training a Convolutional Neural Network (CNN)? How does the
choice of optimizer impact the training process?
• Explain what is an optimizer
• Need for optimizing the CNN model
• How optimizers makes it a better model
• Which optimizer are we choosing for the problem
Refer: https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-
optimizers/
Theoretical question
Compare two activation functions and how does usage of these activation functions impact the CNN
design?
Perform comparison of two activation functions using the details below and explain in detail:
• Explain about the activation functions given in the prob.
• Plot the graph for the activation function and analyze the dataset which would be suitable for such
kind of results.
• Once, the type of problem is identified, explain the relation between that particular activation
function and the solution of the problem
• Explain the relevance of the activation function in solving the problem.
Refer: https://www.analyticsvidhya.com/blog/2020/01/fundamentals-deep-learning-activation-
functions-when-to-use-them/
Theoretical Questions
Compare and contrast the characteristics of gradient descent (GD) and Adam optimizers. How do
they differ in terms of adaptive learning rates and momentum?
• Explain what is gradient descent (GD) and Adam optimizer
• Compare both of the optimizers
Refer: https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-
optimizers/
• Momentum is a gradient descent optimization approach that adds a percentage of the prior update
vector to the current update vector to speed up the learning process. In basic terms, momentum is a
method of smoothing out model parameter updates and allowing the optimizer to continue
advancing in the same direction as previously, minimizing oscillations and increasing convergence
speed.
Theoretical Question
Explain the concept of learning rate decay in optimizers for CNNs. Why might using a high initial
learning rate be problematic, and how does learning rate decay help mitigate this issue?
• Learning rate decay is a technique used in training Convolutional Neural Networks (CNNs) and other machine learning models to adaptively adjust the
learning rate during the training process. The learning rate is a hyperparameter that determines the step size or rate at which the model's parameters
(weights and biases) are updated during optimization, typically using gradient-based optimization algorithms. Learning rate decay involves systematically
reducing the learning rate over time as training progresses.
Here's why it's important and how it works:
Problem with a High Initial Learning Rate:
Using a high initial learning rate can lead to several issues during training:
• Overshooting
• Instability
• Divergence
How Learning Rate Decay Helps:
Learning rate decay is used to mitigate the issues associated with a high initial learning rate.
It works by gradually reducing the learning rate as training progresses. Here's how it helps:
• Stability
• Convergence
• Generalization: Lower learning rates towards the end of training can improve the model's generalization to unseen data. This is because a decreasing
learning rate encourages the model to learn more robust and general features rather than memorizing the training data (overfitting).
Numerical Problem :
Topic:Learning Rate
Q. You are training a CNN model for image classification using the SGD optimizer
with momentum. The initial learning rate is set to 0.1, and the momentum
coefficient is 0.9. After training for 100 epochs, you notice that the loss function has
converged, but the model's accuracy on the validation set is not improving. One
potential solution is to adjust the learning rate. If you decide to reduce the learning
rate by a factor of 0.5 every 20 epochs, what will the learning rate be at the start of
the 101st epoch?
Ans:
Ans:
Topic: CNN Trainable Parameters
Given:
● The network has three layers.
● Number of neurons in each layer:
● Layer 1: 100 neurons
● Layer 2: 50 neurons
● Layer 3: 10 neurons
Steps:
● As this is the input layer, we assume that the number of input features matches the number
of neurons. Thus, there are no weights leading into this layer.
For the second layer:
2. Calculate the number of biases:
● Weights from layer 1 to layer For every neuron in a layer, there is one bias term.
2:
For the first layer:
● 100×50=5000
● 100 biases (since there are 100 neurons).
● 100×50=5000 weights.
For the second layer:
For the third layer:
● 50 biases (since there are 50 neurons).
● Weights from layer 2 to layer For the third layer:
3:
● 10 biases (since there are 10 neurons).
● 50×10=500
● 50×10=500 weights. Total weights = 5000(fromlayer 1 to 2) + 500(fromlayer 2 to
3) = 5500
Total biases = 100(forlayer 1)+50(forlayer 2)+10(forlayer
3)=160
Total learnable parameters =
Totalweights+Totalbiases=5500+160=5660
Conclusion:
Topic:Adam optimizer – Learning Rate
Q. You are training a CNN using the Adam optimizer for image classification. The
initial learning rate is set to 0.01, and the exponential decay rates for the first and
second moments are 0.9 and 0.999, respectively. You start with a batch of 64
training examples. Calculate the effective learning rate for the first iteration after the
Adam optimizer's parameter updates.
Ans:
Conclusion:
The effective learning rate for the first iteration after the Adam optimizer's
parameter updates is approximately 0.1
Note that without knowing the exact gradient values, this is a rough approximation
based on the initialization and bias-correction terms.
Topic: Numerical Problem on Gradient
Q.You are training a CNN using backpropagation for image segmentation. The loss
function you are using is the mean squared error (MSE). After a forward pass, you
calculate the following activations and target values for a specific pixel in the output
layer:
Calculate the gradient of the MSE loss with respect to the activation of this pixel.
Ans:
Ans:
Conclusion:
The gradient of the MSE loss with respect to the activation of this specific pixel is 0.2.
This gradient indicates the direction and magnitude by which we should adjust the
network's weights (via backpropagation) to reduce the error for this specific pixel.
Topic:Weight and Biases, Parameter updates
Q.You are training a CNN for image segmentation using backpropagation and the
Adam optimizer. The network architecture consists of four convolutional layers and
two fully connected layers. The batch size is 32, and you train for 60 epochs.
Calculate the total number of parameter updates (weight and bias updates)
performed during training.
Ans
Assumed architecture:
● Conv1: 16 filters of size 3x3 (input channels depend on the input image's depth, e.g.,
3 for RGB)
● Conv2: 32 filters of size 3x3
● Conv3: 64 filters of size 3x3
● Conv4: 128 filters of size 3x3
● FC1: 512 neurons
● FC2: 100 neurons
Remember, convolutional weights are determined by the filter size and the number of
input/output channels. Biases are determined by the number of output channels (or filters)
for each layer.
Parameters for Conv1:
● Weights: 16 filters * 3 * 3 * 3 (assuming RGB input) = 432
● Biases: 16
Number of batches = Total number of images / Batch size = 1000 / 32 ≈ 31.25 (rounded up
to 32)
So, throughout your training, you'll perform approximately 518,617,600 weight and bias
updates.
Note: This is a hypothetical calculation based on assumed architecture. The exact number
will depend on the architecture details of your CNN.
Topic:Weight and Biases, Parameter updates
Q. You are training a CNN for image segmentation using backpropagation and the
Adam optimizer. The network architecture consists of four convolutional layers and
two fully connected layers. The batch size is 32, and you train for 60 epochs.
Calculate the total number of parameter updates (weight and bias updates)
performed during training.
Ans:
Given Network Architecture:
● 4 convolutional layers
● 2 fully connected layers
● Conv1 biases: 64
● Conv2 biases: 128
● Conv3 biases: 256
● Conv4 biases: 512
● FC1 biases: 1024
● FC2 biases: 512
3. Calculate Weights for Each Layer:
Weights connect neurons from one layer to the next. In fully connected layers, every neuron in the
previous layer connects to every neuron in the next layer. For convolutional layers, this is a bit
trickier; the weights for a convolutional layer are determined by the filter size and the depth of the
previous layer. However, in this explanation, we're simply assuming a connection count based on
neuron count.
● Conv1 weights: The first layer typically takes input from an image, so the weights are usually
determined by the filter size, the depth of the input (like RGB channels), and the number of filters.
This isn't given here, so we have a simplified calculation:
(3 * 64) + 64 (biases) = 256
● Conv2 weights: (64 filters from Conv1 * 128 filters in Conv2) + 128 biases = 8320
● Conv3 weights: (128 * 256) + 256 biases = 33024
● Conv4 weights: (256 * 512) + 512 biases = 131584
● FC1 weights: (Assuming the output of Conv4 is flattened to be 512 features)
(512 * 1024) + 1024 biases = 524800
● FC2 weights: (1024 * 512) + 512 biases = 525312
4. Summing Up:
Total number of weights (without biases) across all layers = 256 + 8320 + 33024 + 131584 + 524800
+ 525312 = 1,193,296
Total number of biases across all layers = 64 + 128 + 256 + 512 + 1024 + 512 = 2,496
Number of batches = Total images / Batch size = 1,000 / 32 = 31.25. But since we can't have a
fraction of a batch, you'd typically round up or handle the last batch as a smaller one. For simplicity,
let's assume 32 batches.
Total number of parameter updates = Total parameters * Number of epochs * Number of batches
= 1,195,792 * 60 * 32 = 2,290,713,600 updates
Thus, throughout the training, approximately 2,290,713,600 weight and bias updates will be
performed.
Thank you