0% found this document useful (0 votes)
20 views14 pages

QB Unit 3

Uploaded by

kowshikch2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views14 pages

QB Unit 3

Uploaded by

kowshikch2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Discuss on Gradient Descent Optimization.

 Gradient Descent is a generic optimization algorithm capable of finding optimal solutions

to a wide range of problems.

 The general idea is to tweak parameters iteratively in order to minimize the cost function.

 An important parameter of Gradient Descent (GD) is the size of the steps, determined by the learning
rate hyperparameters. If the learning rate is too small, then the algorithm will have to go through many
iterations to converge, which will take a long time, and if it is too high we may jump the optimal value.

Define Stochastic Gradient Descent (SGD) with advantages and disadvantages.

 In Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for
each iteration.

 In Gradient Descent, there is a term called “batch” which denotes the total number of samples from a
dataset that is used for calculating the gradient for each iteration.

 In typical Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to be the
whole dataset.

Advantages

 Speed: SGD is faster than other variants of Gradient Descent.

 Memory Efficiency:it is memory-efficient and can handle large datasets that cannot fit into memory.

 Avoidance of Local Minima: Due to the noisy updates in SGD, it has the ability to escape from local
minima and converge to a global minimum.

Disadvantages

 Noisy updates: The updates in SGD are noisy and have a high variance, which can make

the optimization process less stable and lead to oscillations around the minimum.

 Slow Convergence: SGD may require more iterations to converge to the minimum since

it updates the parameters for each training example one at a time.

 Sensitivity to Learning Rate: The choice of learning rate can be critical in SGD since

using a high learning rate can cause the algorithm to overshoot the minimum, while a low

learning rate can make the algorithm converge slowly.

 Less Accurate: Due to the noisy updates, SGD may not converge to the exact global
minimum and can result in a suboptimal solution. This can be mitigated by using

techniques such as learning rate scheduling and momentum-based updates.

Write the difference between different gradient descent algorithms.

Define Hyperparameter tuning and its strategies.

 A Machine Learning model is defined as a mathematical model with a number of parameters that need
to be learned from the data. By training a model with existing data, we are able to fit the model
parameters.

 However, there is another kind of parameter, known as Hyperparameters, that cannot be directly
learned from the regular training process. They are usually fixed before the actual training process
begins. These parameters express important properties of the model such as its complexity or how fast
it should learn.

GridSearchCV

RandomizedSearchCV

Explain the principle of the gradient descent algorithm. Accompany your explanation with a diagram.

Solution: Training can be posed as an optimization problem, in which the goal is to optimize a function
(usually to minimize a cost function E) with respect to a number of free variables, usually weights wi.
The gradient decent algorithm begins from an initialization of the weights (e.g. a random initialization)
and in an iterative procedure updates the weights wi by a quantity Δwi, where Δwi = –α (∂E / ∂wi) and
(∂E / ∂wi) is the gradient of the cost function with respect to the weights, while α is a constant which
takes small values in order to keep the updates low and avoid oscillations.

16 Marks
Discuss in detail about how the network is training.

All Neurons of a given Layer are generating an Output, but they don’t have the same Weight for the next
Neurons Layer. This means that if a Neuron on a layer observes a given pattern it might mean less for
the overall picture and will be partially or completely muted. This is called Weighting.

A big weight means that the Input is important and of course a small weight means that we should
ignore it. Every Neural Connection between Neurons will have an associated Weight.

Weights will be adjusted over the training to fit the objectives we have set (recognize that a dog is a dog
and that a cat is a cat).

 In simple terms: Training a Neural Network means finding the appropriate Weights of the Neural
Connections thanks to a feedback loop called Gradient Backward propagation .

Steps to Training an Artificial Neural Network

1. First an ANN will require a random weight initialization

2. Split the dataset in batches (batch size)

3. Send the batches 1 by 1 to the GPU

4. Calculate the forward pass (what would be the output with the current weights)

5. Compare the calculated output to the expected output (loss)

6. Adjust the weights (using the learning rate increment or decrement) according to

the backward pass (backward gradient propagation).

7. Go back to step 2

Discuss in detail about Gradient descent optimization Algorithm.

Gradient Descent is a generic optimization algorithm capable of finding optimal solutions to a wide
range of problems.

The general idea is to tweak parameters iteratively in order to minimize the cost function.

An important parameter of Gradient Descent (GD) is the size of the steps, determined by the learning
rate hyperparameters. If the learning rate is too small, then the algorithm will have to go through many
iterations to converge, which will take a long time, and if it is too high we may jump the optimal value.

Types of Gradient Descent:


Typically, there are three types of Gradient Descent:

1. Batch Gradient Descent

Batch Gradient Descent involves calculations over the full training set at each step as a result of which it
is very slow on very large training data. Thus, it becomes very computationally expensive to do Batch
GD.

2. Stochastic Gradient Descent

In SGD, only one training example is used to compute the gradient and update the parameters at each
iteration. This can be faster than batch gradient descent but may lead to more noise in the updates.

3. Mini-batch Gradient Descent

In mini-batch gradient descent, a small batch of training examples is used to compute the gradient and
update the parameters at each iteration. This can be a good compromise between batch gradient
descent and SGD, as it can be faster than batch gradient descent and less noisy than SGD.

Explain, How Hebb learning rule works for supervised learning mechanism.

Hebb’s Postulate

“When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in
firing it, some growth process or metabolic change takes place in one or both cells such that A’s
efficiency, as one of the cells firing B, is increased.

Hebb’s learning law can be used in combination with a variety of neural network architectures.
Figure: Linear Associator

The linear associator is an example of a type of neural network called an associative memory. The task
of an associative memory is to learn pairs of prototype input/output vectors:

The Hebb Rule

For the supervised Hebb rule we substitute the target output for the actual output. In this way, we are
telling the algorithm what the network should do, rather than what it is currently doing. The resulting
equation is:

How neural network can be trained to generalize and discuss with its key strategies. Discuss
elaborately on Early Stopping.
A network trained to generalize will perform as well in new situations as it does on the data on which it
was trained. The key strategy we will use for obtaining good generalization is to find the simplest model
that explains the data. In terms of neural networks, the simplest model is the one that contains the
smallest number of free parameters (weights and biases), or, equivalently, the smallest number of
neurons. To find a network that generalizes well, we need to find the simplest network that fits the data.

There are at least five different approaches that people have used to produce simple networks: growing,
pruning, global searches, regularization, and early stopping. Growing methods start with no neurons in
the network and then add neurons until the performance is adequate. Pruning methods start with large
networks, which likely overfit, and then remove neurons (or weights) one at a time until the
performance degrades significantly. Global searches, such as genetic algorithms, search the space of all
possible network architectures to locate the simplest model that explains the data.

The final two approaches, regularization and early stopping, keep the network small by constraining the
magnitude of the network weights, rather than by constraining the number of network weights. In this
chapter we will concentrate on these two approaches.

Methods for Improving Generalization

These approaches fit into two general categories: restricting the number of weights (or, equivalently,
the number of neurons) in the network, or restricting the magnitude of the weights.

Early Stopping The first method we will discuss for improving generalization is also the simplest method.
It is called early stopping. The idea behind this method is that as training progresses the network uses
more and more of its weights, until all weights are fully used when training reaches a minimum of the
error surface. By increasing the number of iterations of training, we are increasing the complexity of the
resulting network. If training is stopped before the minimum is reached, then the network will
effectively be using fewer parameters and will be less likely to overfit. In a later section of this chapter
we will demonstrate how the number of parameters changes as the number of iterations increases. In
order to use early stopping effectively, we need to know when to stop the training. We will describe a
method, called cross-validation, that uses a validation set to decide when to stop. The available data
(after removing the test set, as described above) is divided into two parts: a training set and a validation
set. The training set is used to compute gradients or Jacobians and to determine the weight update at
each iteration. The validation set is an indicator of what is happening to the network function “in
between” the training points, and its error is monitored during the training process. When the error on
the validation set goes up for several iterations, the training is stopped, and the weights that produced
the minimum error on the validation set are used as the final trained network weights.
Illustration of Early Stopping

How regularization guide to train the neural network to avoid Overfitting issues?

The standard performance index for neural network training is the sum squared error on the training
set:

where aqis the network output for input pq . We are using the variable to ED represent the sum squared
error on the training data. Under certain conditions, this regularization term can be written as the sum
of squares of the network weights, as in:

Where, alpha/beta is the ratio controls the effective complexity of the network solution. The larger this
ratio is, the smoother the network response.
When the weights are large, the function created by the network can have large slopes, and is therefore
more likely to overfit the training data. If werestrict the weights to be small, then the network function
will create a smooth interpolation through the training data - just as if the network had a small number
of neurons.

Effect of Weight on Network Response

There are several techniques for setting the regularization parameter. One approach is to use a
validation set, such as on early stopping; the regularization parameter is set to minimize the squared
error on the validation set.
Effect of Regularization Ratio

MCQ
1. The cost function is minimized by __________
a) Linear regression
b) Polynomial regression
c) PAC learning
d) Gradient descent
2. What happens when the learning rate is low?
a) It always reaches the minima quickly
b) It reaches the minima very slowly
c) It overshoots the minima
d) Nothing happens
3. Which of the following statements is true about the learning rate
alpha in gradient descent?
a) If alpha is very small, gradient descent will be fast to
converge. If alpha is too large, gradient descent will
overshoot
b) If alpha is very small, gradient descent can be slow
to converge. If alpha is too large, gradient descent
will overshoot
c) If alpha is very small, gradient descent can be slow to
converge. If alpha is too large, gradient descent can be slow
too
d) If alpha is very small, gradient descent will be fast to
converge. If alpha is too large, gradient descent will be slow.

4. Suppose you have a neural network that is overfitting to the


training data. Which of the following can fix the situation?
a) Regularization
b) Decrease model complexity
c) Train less/early stopping
d) All of the above

4. What is the risk with tuning hyper-parameters using a test dataset?


a) Model will overfit the test set
b) Model will underfit the test set
c) Model will overfit the training set
d) Model will perform balanced

5. What is hebbian learning?


a) synaptic strength is proportional to correlation between firing of post & presynaptic
neuron
b) synaptic strength is proportional to correlation between firing of postsynaptic neuron only
c) synaptic strength is proportional to correlation between firing of presynaptic neuron only
d) none of the mentioned
6. What is differential hebbian learning?
a) synaptic strength is proportional to correlation between firing of post & presynaptic neuron
b) synaptic strength is proportional to correlation between firing of postsynaptic neuron only
c) synaptic strength is proportional to correlation between firing of presynaptic neuron only
d) synaptic strength is proportional to changes in correlation between firing of post &
presynaptic neuron
7. What is the objective of backpropagation algorithm?
a) to develop learning algorithm for multilayer feedforward neural network
b) to develop learning algorithm for single layer feedforward neural network
c) to develop learning algorithm for multilayer feedforward neural network, so that network
can be trained to capture the mapping implicitly
d) None of the above.
8. What is meant by generalized in statement “backpropagation is a generalized delta rule” ?
a) because delta rule can be extended to hidden layer units
b) because delta is applied to only input and output layers, thus making it more simple and
generalized
c) it has no significance
d) None of the above.

9. What is the purpose of regularization in machine learning?

a) To reduce the number of features in a model


b) To prevent overfitting and improve generalization
c) To speed up the training process
d) To increase the accuracy of the model

10. What is the purpose of cross-validation in machine learning?

a) To evaluate the performance of a model on a held-out test set


b) To evaluate the performance of a model on different subsets of the data
c) To compare the performance of different models
d) To tune the hyperparameters of a model

11. Which of the following is a common approach to reducing overfitting?

a) Dropout
b) Batch normalization
c) Early stopping
d) All of the above
12. Which of the following is a common approach to solving a time series
forecasting problem?

a) ARIMA models
b) Exponential smoothing
c) Recurrent neural networks
d) All of the above

13. Which of the following statements is false about gradient descent?


a) It updates the weight to comprise a small step in the direction of the negative gradient
b) The learning rate parameter is η where η > 0
c) In each iteration, the gradient is re-evaluated for the new weight vector
d) In each iteration, the weight is updated in the direction of positive gradient

14. In batch method gradient descent, each step requires the entire training set be processed in order to
evaluate the error function.
a) True
b) False

15. Gradient descent is an optimization algorithm for finding the local minimum of a function.
a) True
b) False

16. Which of the following statements is false about gradient descent?


a) It updates the weight to comprise a small step in the direction of the negative gradient
b) The learning rate parameter is η where η > 0
c) In each iteration, the gradient is re-evaluated for the new weight vector
d) In each iteration, the weight is updated in the direction of positive gradient

17. In batch method gradient descent, each step requires the entire training set be processed in order to
evaluate the error function.
a) True
b) False

18. Which of the following statements is false about choosing learning rate in gradient descent?
a) Small learning rate leads to slow convergence
b) Large learning rate cause the loss function to fluctuate around the minimum
c) Large learning rate can cause to divergence
d) Small learning rate cause the training to progress very fast

19. Which of the following is not related to a gradient descent?


a) AdaBoost
b) Adadelta
c) Adagrad
d) RMSprop

20. Which of the following statements is not true about the cost function?
a) It is a measure of how good a neural network did with respect to its given training sample and the
expected output
b) It depend on variables such as weights
c) It is a single value, not a vector
d) It never depends on bias

21. Which of the following is not a cost function requirement?


a) The cost function must be able to be written as an average
b) The cost function must not be dependent on any activation values of a neural network
c) Technically a cost function can be dependent on any output values
d) If the cost function is dependent on other activation layers besides the output one, back
propagation will be valid

22. Which of the following statements is not true about cost function?
a) Cost function is also called a loss or error function
b) Its goal is to maximize the cost function
c) We want to define a cost function to find the weight in the neural network
d) It wants different cost functions for regression and classification problems

23. If the weight matrix stores the given patterns, then the network becomes?
a) autoassoiative memory
b) heteroassociative memory
c) multidirectional assocative memory
d) temporal associative memory

24. If the weight matrix stores multiple associations among several patterns, then network becomes?
a) autoassoiative memory
b) heteroassociative memory
c) multidirectional assocative memory
d) temporal associative memory

25. If the weight matrix stores association between adjacent pairs of patterns, then network becomes?
a) autoassoiative memory
b) heteroassociative memory
c) multidirectional assocative memory
d) temporal associative memory

26. What are some of desirable characteristics of associative memories?


a) ability to store large number of patterns
b) fault tolerance
c) able to recall, even for input pattern is noisy
d) All of the mentioned
27. Which of the following statement is incorrect about backpropagation?
a) It is an algorithm commonly used to train the neural networks
b) It helps to adjust the weights of the neurons so that the accuracy of the output increases
c) It is a method of training the neural networks to perform tasks more accurately
d) The idea behind backpropagation is not to test how wrong the neural network is?

28. Why should one stop gradient checking once it is done before running the network for entire set of
training iterations?
a) Because it would increase the speed of training process
b) Because it would change the output of the training process
c) Because it would slow down the speed of training process
d) Because it would nullify the output

You might also like