QB Unit 3
QB Unit 3
The general idea is to tweak parameters iteratively in order to minimize the cost function.
An important parameter of Gradient Descent (GD) is the size of the steps, determined by the learning
rate hyperparameters. If the learning rate is too small, then the algorithm will have to go through many
iterations to converge, which will take a long time, and if it is too high we may jump the optimal value.
In Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for
each iteration.
In Gradient Descent, there is a term called “batch” which denotes the total number of samples from a
dataset that is used for calculating the gradient for each iteration.
In typical Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to be the
whole dataset.
Advantages
Memory Efficiency:it is memory-efficient and can handle large datasets that cannot fit into memory.
Avoidance of Local Minima: Due to the noisy updates in SGD, it has the ability to escape from local
minima and converge to a global minimum.
Disadvantages
Noisy updates: The updates in SGD are noisy and have a high variance, which can make
the optimization process less stable and lead to oscillations around the minimum.
Slow Convergence: SGD may require more iterations to converge to the minimum since
Sensitivity to Learning Rate: The choice of learning rate can be critical in SGD since
using a high learning rate can cause the algorithm to overshoot the minimum, while a low
Less Accurate: Due to the noisy updates, SGD may not converge to the exact global
minimum and can result in a suboptimal solution. This can be mitigated by using
A Machine Learning model is defined as a mathematical model with a number of parameters that need
to be learned from the data. By training a model with existing data, we are able to fit the model
parameters.
However, there is another kind of parameter, known as Hyperparameters, that cannot be directly
learned from the regular training process. They are usually fixed before the actual training process
begins. These parameters express important properties of the model such as its complexity or how fast
it should learn.
GridSearchCV
RandomizedSearchCV
Explain the principle of the gradient descent algorithm. Accompany your explanation with a diagram.
Solution: Training can be posed as an optimization problem, in which the goal is to optimize a function
(usually to minimize a cost function E) with respect to a number of free variables, usually weights wi.
The gradient decent algorithm begins from an initialization of the weights (e.g. a random initialization)
and in an iterative procedure updates the weights wi by a quantity Δwi, where Δwi = –α (∂E / ∂wi) and
(∂E / ∂wi) is the gradient of the cost function with respect to the weights, while α is a constant which
takes small values in order to keep the updates low and avoid oscillations.
16 Marks
Discuss in detail about how the network is training.
All Neurons of a given Layer are generating an Output, but they don’t have the same Weight for the next
Neurons Layer. This means that if a Neuron on a layer observes a given pattern it might mean less for
the overall picture and will be partially or completely muted. This is called Weighting.
A big weight means that the Input is important and of course a small weight means that we should
ignore it. Every Neural Connection between Neurons will have an associated Weight.
Weights will be adjusted over the training to fit the objectives we have set (recognize that a dog is a dog
and that a cat is a cat).
In simple terms: Training a Neural Network means finding the appropriate Weights of the Neural
Connections thanks to a feedback loop called Gradient Backward propagation .
4. Calculate the forward pass (what would be the output with the current weights)
6. Adjust the weights (using the learning rate increment or decrement) according to
7. Go back to step 2
Gradient Descent is a generic optimization algorithm capable of finding optimal solutions to a wide
range of problems.
The general idea is to tweak parameters iteratively in order to minimize the cost function.
An important parameter of Gradient Descent (GD) is the size of the steps, determined by the learning
rate hyperparameters. If the learning rate is too small, then the algorithm will have to go through many
iterations to converge, which will take a long time, and if it is too high we may jump the optimal value.
Batch Gradient Descent involves calculations over the full training set at each step as a result of which it
is very slow on very large training data. Thus, it becomes very computationally expensive to do Batch
GD.
In SGD, only one training example is used to compute the gradient and update the parameters at each
iteration. This can be faster than batch gradient descent but may lead to more noise in the updates.
In mini-batch gradient descent, a small batch of training examples is used to compute the gradient and
update the parameters at each iteration. This can be a good compromise between batch gradient
descent and SGD, as it can be faster than batch gradient descent and less noisy than SGD.
Explain, How Hebb learning rule works for supervised learning mechanism.
Hebb’s Postulate
“When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in
firing it, some growth process or metabolic change takes place in one or both cells such that A’s
efficiency, as one of the cells firing B, is increased.
Hebb’s learning law can be used in combination with a variety of neural network architectures.
Figure: Linear Associator
The linear associator is an example of a type of neural network called an associative memory. The task
of an associative memory is to learn pairs of prototype input/output vectors:
For the supervised Hebb rule we substitute the target output for the actual output. In this way, we are
telling the algorithm what the network should do, rather than what it is currently doing. The resulting
equation is:
How neural network can be trained to generalize and discuss with its key strategies. Discuss
elaborately on Early Stopping.
A network trained to generalize will perform as well in new situations as it does on the data on which it
was trained. The key strategy we will use for obtaining good generalization is to find the simplest model
that explains the data. In terms of neural networks, the simplest model is the one that contains the
smallest number of free parameters (weights and biases), or, equivalently, the smallest number of
neurons. To find a network that generalizes well, we need to find the simplest network that fits the data.
There are at least five different approaches that people have used to produce simple networks: growing,
pruning, global searches, regularization, and early stopping. Growing methods start with no neurons in
the network and then add neurons until the performance is adequate. Pruning methods start with large
networks, which likely overfit, and then remove neurons (or weights) one at a time until the
performance degrades significantly. Global searches, such as genetic algorithms, search the space of all
possible network architectures to locate the simplest model that explains the data.
The final two approaches, regularization and early stopping, keep the network small by constraining the
magnitude of the network weights, rather than by constraining the number of network weights. In this
chapter we will concentrate on these two approaches.
These approaches fit into two general categories: restricting the number of weights (or, equivalently,
the number of neurons) in the network, or restricting the magnitude of the weights.
Early Stopping The first method we will discuss for improving generalization is also the simplest method.
It is called early stopping. The idea behind this method is that as training progresses the network uses
more and more of its weights, until all weights are fully used when training reaches a minimum of the
error surface. By increasing the number of iterations of training, we are increasing the complexity of the
resulting network. If training is stopped before the minimum is reached, then the network will
effectively be using fewer parameters and will be less likely to overfit. In a later section of this chapter
we will demonstrate how the number of parameters changes as the number of iterations increases. In
order to use early stopping effectively, we need to know when to stop the training. We will describe a
method, called cross-validation, that uses a validation set to decide when to stop. The available data
(after removing the test set, as described above) is divided into two parts: a training set and a validation
set. The training set is used to compute gradients or Jacobians and to determine the weight update at
each iteration. The validation set is an indicator of what is happening to the network function “in
between” the training points, and its error is monitored during the training process. When the error on
the validation set goes up for several iterations, the training is stopped, and the weights that produced
the minimum error on the validation set are used as the final trained network weights.
Illustration of Early Stopping
How regularization guide to train the neural network to avoid Overfitting issues?
The standard performance index for neural network training is the sum squared error on the training
set:
where aqis the network output for input pq . We are using the variable to ED represent the sum squared
error on the training data. Under certain conditions, this regularization term can be written as the sum
of squares of the network weights, as in:
Where, alpha/beta is the ratio controls the effective complexity of the network solution. The larger this
ratio is, the smoother the network response.
When the weights are large, the function created by the network can have large slopes, and is therefore
more likely to overfit the training data. If werestrict the weights to be small, then the network function
will create a smooth interpolation through the training data - just as if the network had a small number
of neurons.
There are several techniques for setting the regularization parameter. One approach is to use a
validation set, such as on early stopping; the regularization parameter is set to minimize the squared
error on the validation set.
Effect of Regularization Ratio
MCQ
1. The cost function is minimized by __________
a) Linear regression
b) Polynomial regression
c) PAC learning
d) Gradient descent
2. What happens when the learning rate is low?
a) It always reaches the minima quickly
b) It reaches the minima very slowly
c) It overshoots the minima
d) Nothing happens
3. Which of the following statements is true about the learning rate
alpha in gradient descent?
a) If alpha is very small, gradient descent will be fast to
converge. If alpha is too large, gradient descent will
overshoot
b) If alpha is very small, gradient descent can be slow
to converge. If alpha is too large, gradient descent
will overshoot
c) If alpha is very small, gradient descent can be slow to
converge. If alpha is too large, gradient descent can be slow
too
d) If alpha is very small, gradient descent will be fast to
converge. If alpha is too large, gradient descent will be slow.
a) Dropout
b) Batch normalization
c) Early stopping
d) All of the above
12. Which of the following is a common approach to solving a time series
forecasting problem?
a) ARIMA models
b) Exponential smoothing
c) Recurrent neural networks
d) All of the above
14. In batch method gradient descent, each step requires the entire training set be processed in order to
evaluate the error function.
a) True
b) False
15. Gradient descent is an optimization algorithm for finding the local minimum of a function.
a) True
b) False
17. In batch method gradient descent, each step requires the entire training set be processed in order to
evaluate the error function.
a) True
b) False
18. Which of the following statements is false about choosing learning rate in gradient descent?
a) Small learning rate leads to slow convergence
b) Large learning rate cause the loss function to fluctuate around the minimum
c) Large learning rate can cause to divergence
d) Small learning rate cause the training to progress very fast
20. Which of the following statements is not true about the cost function?
a) It is a measure of how good a neural network did with respect to its given training sample and the
expected output
b) It depend on variables such as weights
c) It is a single value, not a vector
d) It never depends on bias
22. Which of the following statements is not true about cost function?
a) Cost function is also called a loss or error function
b) Its goal is to maximize the cost function
c) We want to define a cost function to find the weight in the neural network
d) It wants different cost functions for regression and classification problems
23. If the weight matrix stores the given patterns, then the network becomes?
a) autoassoiative memory
b) heteroassociative memory
c) multidirectional assocative memory
d) temporal associative memory
24. If the weight matrix stores multiple associations among several patterns, then network becomes?
a) autoassoiative memory
b) heteroassociative memory
c) multidirectional assocative memory
d) temporal associative memory
25. If the weight matrix stores association between adjacent pairs of patterns, then network becomes?
a) autoassoiative memory
b) heteroassociative memory
c) multidirectional assocative memory
d) temporal associative memory
28. Why should one stop gradient checking once it is done before running the network for entire set of
training iterations?
a) Because it would increase the speed of training process
b) Because it would change the output of the training process
c) Because it would slow down the speed of training process
d) Because it would nullify the output