5
5
5
INTERVIEW PREPARATION
(30 Days of Interview
Preparation)
# Day-5
Since one epoch is too large to feed to the computer at once, we divide it into several smaller batches.
We always use more than one Epoch because one epoch leads to underfitting.
As the number of epochs increases, several times the weight are changed in the neural network and the
curve goes from underfitting up to optimal to overfitting curve.
Unlike the learning rate hyperparameter where its value doesn’t affect computational time, the batch
sizes must be examined in conjunctions with the execution time of training. The batch size is limited by
hardware’s memory, while the learning rate is not. Leslie recommends using a batch size that fits in
hardware’s memory and enables using larger learning rate.
If our server has multiple GPUs, the total batch size is the batch size on a GPU multiplied by the numbers
of GPU. If the architectures are small or your hardware permits very large batch sizes, then you might
compare the performance of different batch sizes. Also, recall that small batch sizes add regularization
while large batch sizes add less, so utilize this while balancing the proper amount of regularization. It is
often better to use large batch sizes so a larger learning rate can be used.
More technically, at each training stage, individual nodes are either dropped out of the net with
probability 1-p or kept with probability p, so that a reduced network is left; incoming and outgoing edges
to a dropped-out node are also removed.
Where to use
It can be used with most types of layers, such as dense fully connected layers, convolutional layers, and
recurrent layers such as the long short-term memory network layer.
Dropout may be implemented on any or all hidden layers in the network as well as the visible or input
layer. It is not used on the output layer.
Benefits:-
1. Dropout forces a neural network to learn more robust features that are very useful in conjunction
with different random subsets of the other neurons.
2. Dropout generally doubles the number of iterations required to converge. However, the training
time for each epoch is less.
• Grid Search
• Random Search
1. Observe and understand the clues available during training by monitoring validation/test
loss early in training, tune your architecture and hyper-parameters with short runs of a few
epochs.
2. Signs of underfitting or overfitting of the test or validation loss early in the training process are
useful for tuning the hyper-parameters.
• Sage Maker
• Comet.ml
• Weights &Biases
• Deep Cognition
• Azure ML
In most learning networks, an error is calculated as the difference between the predicted output and the
actual output.
The function that is used to compute this error is known as Loss Function J(.). Different loss functions
will give different errors for the same prediction, and thus have a considerable effect on the performance
of the model. One of the most widely used loss function is mean square error, which calculates the square
of the difference between the actual values and predicted value. Different loss functions are used to deals
with a different type of tasks, i.e. regression and classification.
Absolute error
1. Binary Cross-Entropy
2. Negative Log-Likelihood
3. Margin Classifier
4. Soft Margin Classifier
Activation functions decide whether a neuron should be activated or not by calculating a weighted sum
and adding bias with it. The purpose of the activation function is to introduce non-linearity into the output
of a neuron.
In a neural network, we would update the weights and biases of the neurons based on the error at the
outputs. This process is known as back-propagation. Activation function makes the back-propagation
possible since the gradients are supplied along with the errors to update the weights and biases.
A neural network without activation functions is essentially a linear regression model. The activation
functions do the non-linear transformation to the input, making it capable of learning and performing
more complex tasks.
1. Identity
2. Binary Step
3. Sigmoid
4. Tanh
5. ReLU
6. Leaky ReLU
The activation functions do the non-linear transformation to the input, making it capable of learning and
performing more complex tasks.
Q7: What do you under by vanishing gradient problem and how can
Do we solve that?
The problem:
As more layers using certain activation function are added to neural networks, the gradients of the loss
function approach zero, making the networks tougher to train.
Why:
Certain activation functions, like the sigmoid function, squishes a large input space into a small input
space between 0 and 1. Therefore, a large change in the input of the sigmoid function will cause a small
change in the output. Hence, the derivative becomes small.
For shallow networks with only a few layers that use these activations, this isn’t a big problem. However,
when more layers are used, it can cause the gradient to be too small for training to work effectively.
However, when n hidden layers use an activation like the sigmoid function, n small derivatives are
multiplied together. Thus, the gradient decreases exponentially as we propagate down to the initial layers.
The simplest solution is to use other activation functions, such as ReLU, which doesn’t cause a small
derivative.
Residual networks are another solution, as they provide residual connections straight to earlier layers.
It is a popular approach in deep learning where pre-trained models are used as the starting point on
computer vision and natural language processing tasks given the vast compute and time resources
required to develop neural network models on these problems
Transfer learning is a machine learning technique where a model trained on one task is re-purposed on a
second related task.
Transfer learning is an optimization that allows rapid progress or improved performance when modelling
the second task.
Transfer learning only works in deep learning if the model features learned from the first task are general.
This architecture is from the VGG group, Oxford. It improves AlexNet by replacing the large kernel-
sized filter with multiple 3X3 kernel-sized filters one after another. With a given receptive field(the
effective area size of input image on which output depends), multiple stacked smaller size kernel is better
than the one with a larger size kernel because multiple non-linear layers increases the depth of the
network which enables it to learn more complex features, and that too at a lower cost.
Three fully connected layers follow the VGG convolutional layers. The width of the networks starts at
the small value of 64 and increases by a factor of 2 after every sub-sampling/pooling layer. It achieves
the top-5 accuracy of 92.3 % on ImageNet.
Experiments in paper four can judge the power of the residual network. The plain 34 layer network had
high validation error than the 18 layers plain network. This is where we realize the degradation problems.
And the same 34 layers network when converted to the residual network has much less training error
than the 18 layers residual network.
When we hear the about“ImageNet” in the context of deep learning and Convolutional Neural Network,
we are referring to ImageNet Large Scale Visual Recognition Challenge.
The main aim of this image classification challenge is to train the model that can correctly classify an
input image into the 1,000 separate objects category.
These 1,000 image categories represent object classes that we encounter in our day-to-day lives, such as
species of dogs, cats, various household objects, vehicle types, and much more.
When it comes to the image classification, the ImageNet challenge is the “de facto “ benchmark for
computer vision classification algorithms — and the leaderboard for this challenge has
been dominated by Convolutional Neural Networks and Deep learning techniques since 2012.
Clone the repo locally, and you have it. To compile it, run a make. But first, if you intend to use the GPU
capability, you need to edit the Makefile in the first two lines, where you tell it to compile for GPU usage
with CUDA drivers.
Q13: What is YOLO and explain the architecture of YOLO (you only
The first YOLO You only look once (YOLO) version came about May 2016 and sets the core of the
algorithm, the following versions are improvements that fix some drawbacks.
Core Concept:-
The algorithm works off by dividing the image into the grid of the cells, for each cell bounding boxes
and their scores are predicted, alongside class probabilities. The confidence is given in terms of IOU
(intersection over union), metric, which is measuring how much the detected object overlaps with the
ground truth as a fraction of the total area spanned by the two together (the union).
YOLO v2-
This improves on some of the shortcomings of the first version, namely the fact that it is not very good
at detecting objects that are very near and tends to make some of the mistakes on localization.
It introduces a few newer things: Which are anchor boxes (pre-determined sets of boxes such that the
network moves from predicting the bounding boxes to predicting the offsets from these) and the use of
features that are more fine-grained so smaller objects can be predicted better.
YOLO v3-
YOLOv3 came about April 2018, and it adds small improvements, including the fact that bounding boxes
get predicted at the different scales. The underlying meaty part of the YOLO network, Darknet, is
expanded in this version to have 53 convolutional layers