100% found this document useful (1 vote)

170 views29 pages

Unit 1 Question and Answers

Deep learning is a form of machine learning that uses artificial neural networks inspired by the human brain. These neural networks consist of interconnected nodes organized into layers that can automatically learn representations of data. The major deep learning architectures are feedforward neural networks, convolutional neural networks, and recurrent neural networks. A feedforward neural network passes information from the input to output layers without loops. It contains an input layer, one or more hidden layers, and an output layer. A multilayer perceptron is a type of feedforward network with an input, hidden, and output layer where neurons in each layer are fully connected.

Uploaded by

Reason

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

170 views29 pages

Unit 1 Question and Answers

Uploaded by

Reason

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Unit 1 Question and Answers

➢ What is Deep Learning? List the major architectures of deep networks

➢ Deep Learning is a subset of machine learning that focuses on training
artificial neural networks to perform tasks that typically require human
intelligence. It involves learning representations of data through
multiple layers of interconnected nodes (neurons) inspired by the
structure and function of the human brain. Deep Learning has gained
significant attention and success in various fields, including computer
vision, natural language processing, speech recognition, and more.
➢ Deep Learning models automatically learn features from data,
eliminating the need for manual feature engineering. The architectures
of deep networks are designed to handle complex patterns and
relationships within data, making them capable of achieving
state-of-the-art performance on tasks like image classification, object
detection, machine translation, and more.

Key Components of Deep Learning:

■ Neural Networks: At the core of deep learning are artificial neural

networks (ANNs). These networks consist of interconnected
nodes, or neurons, organized into layers. Neurons in one layer
are connected to neurons in the adjacent layers through
weighted connections. Each connection has an associated
weight that adjusts during training.
■ Layers:
➔ Input Layer: This is where data is fed into the network.
Each neuron in the input layer corresponds to a feature or
attribute of the input data.
➔ Hidden Layers: These are intermediate layers between
the input and output layers. Each hidden layer processes
and transforms the data through a combination of
weighted sums and activation functions.
➔ Output Layer: The final layer of the network produces the
model's predictions or classifications.
■ Activation Functions: Neurons apply activation functions to the
weighted sum of their inputs to introduce non-linearity to the
model. Common activation functions include ReLU (Rectified
Linear Unit), sigmoid, and tanh. Activation functions enable
neural networks to learn complex relationships within data.
■ Learning Algorithms: Deep learning models learn by adjusting
the weights of connections between neurons to minimize a
defined loss function. This process involves optimization
algorithms like gradient descent, which iteratively updates the
weights in the direction that reduces the loss.
■ Backpropagation: Backpropagation is a fundamental technique
used to train deep learning models. It involves computing
gradients of the loss function with respect to the model's
weights and using these gradients to update the weights in a
way that minimizes the loss. This process is responsible for
making the network learn and improve its performance over
time.

Major Architectures of Deep Networks:

➔ Feedforward Neural Networks (FNN): This is the simplest form

of a deep neural network, also known as a multilayer perceptron
(MLP). It consists of an input layer, one or more hidden layers,
and an output layer. Each neuron in one layer is connected to
every neuron in the subsequent layer. FNNs are used for tasks
like regression and classification.
➔ Convolutional Neural Networks (CNN): CNNs are primarily used
for image analysis and computer vision tasks. They utilize
convolutional layers to automatically learn spatial hierarchies of
features from images. These networks are well-suited for tasks
such as image classification, object detection, and image
segmentation.
➔ Recurrent Neural Networks (RNN): RNNs are designed to handle
sequential data, such as time series and natural language. They
have loops that allow information to persist across different
time steps, making them capable of capturing temporal
dependencies. Long Short-Term Memory (LSTM) and Gated
Recurrent Unit (GRU) are variations of RNNs that help alleviate
the vanishing gradient problem.

➢ Explain the structure of Feed Forward Neural Network with diagram.

A Feed Forward Neural Network consists of multiple layers of interconnected

neurons, each layer passing information from the input layer to the output
layer in a "feed forward" manner. The data flows in one direction, through the
hidden layers, and ultimately produces an output. The FNN can be divided into
three main types of layers: the input layer, hidden layers, and the output layer.
Input Layer:
● The input layer is the starting point of the network and
represents the raw features or attributes of the input data.
● Each neuron in the input layer corresponds to a feature of the
input data. For example, in image analysis, each pixel's intensity
could be a feature.
● The number of neurons in the input layer is determined by the
dimensionality of the input data.
Hidden Layers:
● Hidden layers are intermediate layers between the input and
output layers.
● Each hidden layer contains a certain number of neurons (also
called units), which learn to capture complex patterns and
relationships in the data.
● The number of hidden layers and the number of neurons in each
layer are design choices that can impact the network's capacity
to learn and its computational efficiency.
● Neurons in hidden layers apply weighted sums of their inputs,
followed by an activation function, to produce their outputs.
These outputs are then passed as inputs to the next layer.
Output Layer:
● The output layer produces the final results of the network's
computation.
● The number of neurons in the output layer is determined by the
nature of the task. For example, in binary classification, there
might be one neuron for the positive class and one for the
negative class. In multi-class classification, there would be as
many neurons as there are classes.
● The activation function used in the output layer depends on the
task. For binary classification, a sigmoid function is often used,
while for multi-class classification, a softmax function is
common.
(notes and book)
➢ Explain multilayer perceptron with diagram.

A Multilayer Perceptron (MLP) is a type of feedforward neural network

architecture that consists of an input layer, one or more hidden layers, and an
output layer. Each layer contains interconnected neurons, which are the basic
processing units that compute weighted sums of their inputs and apply an
activation function to produce an output. MLPs are widely used for various
machine learning tasks, including regression and classification.

Structure and Working of MLP:

Input Layer:
● The input layer serves as the initial point of the network and
represents the input data features.
● Each neuron in the input layer corresponds to a specific feature
of the input data. For instance, in image analysis, each pixel's
intensity could be a feature.
● The number of neurons in the input layer is determined by the
dimensionality of the input data.
Hidden Layers:
● Hidden layers are intermediary layers between the input and
output layers. They allow the network to capture complex
relationships within the data.
● Neurons in the hidden layers process the information received
from the previous layer and apply weighted sums of inputs.
● Each neuron's output is then passed through an activation
function. Common activation functions include the Rectified
Linear Unit (ReLU) and the hyperbolic tangent (tanh) function.
● The number of hidden layers and the number of neurons in each
layer are hyperparameters that can be adjusted based on the
complexity of the problem and the available data.
Output Layer:
● The output layer produces the final results of the network's
computations.
● The number of neurons in the output layer depends on the
specific task. For example, in binary classification, there might
be one neuron for the positive class and one for the negative
class. In multi-class classification, there would be a neuron for
each class.
● The activation function used in the output layer depends on the
nature of the task. For binary classification, a sigmoid function is
commonly used. For multi-class classification, a softmax
function is often employed.

In the diagram, each layer of the Multilayer Perceptron is visually

represented along with the interconnections between neurons. Here's a
breakdown of the diagram:

● Input Layer: The input layer consists of neurons representing the

input features. In the diagram, each box at the top represents a
feature, and each arrow going from the input feature to the
neurons in the hidden layer indicates the input provided by that
feature to each neuron.
● Hidden Layers: The hidden layers are depicted in the middle
section of the diagram. Neurons in the hidden layers take inputs
from the previous layer's neurons (including the input layer) and
process them to produce an output. The connections between
neurons are shown as arrows, and each arrow has a weight
associated with it.
● Output Layer: The output layer is shown at the bottom of the
diagram. Neurons in the output layer take inputs from the last
hidden layer and generate the final output of the network. The
arrows leading from the hidden layer neurons to the output layer
neurons indicate the weighted inputs to each output neuron.
➢ What are the components of a Neural Network? Explain with diagram
A neural network consists of several key components that work
together to process input data, learn from it, and produce desired outputs.
These components include the input layer, hidden layers, weights, biases,
activation functions, and the output layer

Components of a Neural Network:

Input Layer:
● The input layer is the initial part of the neural network where raw
data is fed into the system.
● Each neuron in the input layer corresponds to a feature or
attribute of the input data. For example, in an image
classification task, each pixel's intensity could be a feature.
● The number of neurons in the input layer is determined by the
dimensionality of the input data. If you have N features, you'll
have N neurons in the input layer.
Hidden Layers:
● Hidden layers are situated between the input and output layers.
They perform the computational work of transforming the input
data into meaningful representations.
● Neurons within hidden layers apply weighted sums of their
inputs from the previous layer and then pass the result through
an activation function.
● Each hidden layer captures progressively abstract features as
you move deeper into the network. The number of hidden layers
and neurons per layer are design choices that impact the
network's capacity to learn complex patterns.
Weights and Biases:
● Weights represent the strengths of connections between
neurons in different layers. Each connection between neurons
has an associated weight.
● Biases are added to the weighted sum before passing through
the activation function. They allow the network to shift the
output values.
● Both weights and biases are learned during the training process,
aiming to optimize the network's performance on the given task.
Activation Functions:
● Activation functions introduce non-linearity to the neural
network. Without them, the network would behave like a linear
model, unable to capture complex relationships in the data.
● Common activation functions include:
● ReLU (Rectified Linear Unit): Applies zero for negative
inputs and the input itself for positive inputs.
● Sigmoid: Squeezes inputs into a range between 0 and 1.
● Tanh (Hyperbolic Tangent): Squeezes inputs into a range
between -1 and 1.
● Softmax: Used in the output layer of multi-class
classification to convert raw scores into probabilities.
Output Layer:
● The output layer produces the final results of the neural
network's computation.
● The number of neurons in the output layer depends on the task.
For example, for binary classification, there might be one neuron
for each class. For multi-class classification, there would be a
neuron for each class.
● The activation function in the output layer varies depending on
the task. For instance, in binary classification, a sigmoid
function is typically used. In multi-class classification, a softmax
function is commonly applied.

Detailed Diagram Explanation:

In the labeled diagram provided earlier, each component of the neural network
is visually represented:

➔ The input layer consists of neurons representing individual input

features.
➔ Hidden layers, denoted as "Hidden Neuron," process and transform
data through weighted connections and activation functions.
➔ Arrows between neurons represent weighted connections. Each arrow
has a corresponding weight.
➔ Bias terms are omitted in the diagram for simplicity.
➔ The output layer generates final predictions or classifications.

How do artificial neural networks work?

Artificial Neural Network can be best represented as a weighted directed

graph, where the artificial neurons form the nodes. The association between
the neurons outputs and neuron inputs can be viewed as the directed edges
with weights. The Artificial Neural Network receives the input signal from the
external source in the form of a pattern and image in the form of a vector.
These inputs are then mathematically assigned by the notations x(n) for every
n number of inputs.

Afterward, each of the input is multiplied by its corresponding weights ( these

weights are the details utilized by the artificial neural networks to solve a
specific problem ). In general terms, these weights normally represent the
strength of the interconnection between neurons inside the artificial neural
network. All the weighted inputs are summarized inside the computing unit.

If the weighted sum is equal to zero, then bias is added to make the output
non-zero or something else to scale up to the system's response. Bias has the
same input, and weight equals to 1. Here the total of weighted inputs can be
in the range of 0 to positive infinity. Here, to keep the response in the limits of
the desired value, a certain maximum value is benchmarked, and the total of
weighted inputs is passed through the activation function.

The activation function refers to the set of transfer functions used to achieve
the desired output. There is a different kind of the activation function, but
primarily either linear or non-linear sets of functions
➢ Explain the following terms denoting their notations and equations
(where necessary) with respect to deep neural networks:((Any-5)
○ Connection weights and Biases
○ Epoch
○ Layers and Parameters
○ Activation Functions
○ Loss/Cost Functions
○ Learning rate
○ Sample and batch
○ Hyperparameters

1. Connection weights and Biases

In a deep neural network, connection weights and biases are the
parameters that are learned during the training process. They control how the
network learns to map from its inputs to its outputs.
● Connection weights are the real numbers that are associated with each
connection between two neurons. They represent the strength of the
connection between the two neurons. A higher weight means that the
connection is stronger and will have a greater impact on the output of
the neuron.
● Biases are real numbers that are added to the output of each neuron
before it is passed to the activation function. They can be thought of as
a way of adjusting the output of the neuron so that it is more accurate.

●
● The weights and biases are updated during the training process using a
gradient descent algorithm. The goal of the gradient descent algorithm is to
find the weights and biases that minimize the error between the network's
predictions and the desired outputs.
Connection weights and biases are essential for the learning process in deep
neural networks. They allow the network to learn the relationships between its
inputs and outputs. By adjusting the weights and biases, the network can
learn to make more accurate predictions.

Here are some additional things to keep in mind about connection weights
and biases:

● The weights and biases are initialized randomly at the beginning of the
training process.
● The weights and biases are updated during the training process using
a gradient descent algorithm.
● The weights and biases are typically represented as matrices.
● The number of weights and biases in a neural network can be very
large.
● The weights and biases are important for the performance of the neural
network.

2. Epoch

3. Layers and Parameters

4. Activation Functions
5. Loss/Cost Functions

Loss functions measure the discrepancy between the predicted values of the
model and the actual ground truth values. The goal during training is to
minimize the value of the loss function, which essentially means making the
network's predictions as close to the actual values as possible.

Notation:

● A generic loss function is often denoted as L or loss.

● For a specific type of loss function, such as mean squared error, the
notation could be MSE(y_true, y_pred), where y_true represents the
true (ground truth) values and y_pred represents the predicted values.
6. Learning rate

The learning rate is a hyperparameter that determines the step size at which
the model's parameters (weights and biases) are updated during the
optimization process, such as gradient descent. It's a crucial parameter as it
controls the rate of convergence during training and affects how quickly the
model learns from the data.

Notation: The learning rate is often denoted as α or learning_rate.

7. Sample and batch

In the training process of deep neural networks, data is organized into

samples and processed in batches. These concepts are crucial for efficient
and effective training.

Notation:

● A single data point (input-output pair) is often denoted as (x, y) or

simply referred to as a sample.
● A collection of multiple samples is known as a batch.

Samples:

● A sample represents a single input data point (x) and its corresponding
target/output data (y).
● For example, in image classification, a sample could be an image and
its corresponding label.
● In mathematical notation, a single sample can be represented as (x_i,
y_i) where i is the index of the sample.

Batches:

● A batch is a collection of multiple samples grouped together.

● Batching allows training to be performed on subsets of the entire
dataset.
● A batch typically contains n samples, where n is the batch size.
● Batch processing improves memory efficiency and allows for
parallelization on hardware.
8. Hyperparameters

Hyperparameters are settings and values that are set before training a deep
neural network. These parameters are not learned during the training process;
instead, they are chosen by the developer or researcher and significantly
influence the behavior, capacity, and generalization of the neural network.

Notation:

● There isn't a specific universal notation for hyperparameters; they are

often named individually based on their roles (e.g., batch_size,
learning_rate, hidden_units).

Impact on the Network: Hyperparameters control various aspects of

the network's behavior, including its capacity, convergence rate, and
ability to avoid overfitting. Some key hyperparameters include:

Learning Rate (α): The step size at which the model's parameters are
updated during optimization. Too high a learning rate can lead to
divergence, while too low a learning rate can lead to slow convergence.
Batch Size (n): The number of samples in each training batch. A larger
batch size can lead to smoother parameter updates and more efficient
computation but might require more memory.
Number of Hidden Units/Layers: The depth and width of the network.
More hidden units and layers can increase the network's capacity to
learn complex patterns, but may also lead to overfitting.
Activation Functions: The non-linear functions applied to neuron
outputs. Different activation functions affect how information flows
through the network.
Regularization Strength: Hyperparameters like L1 and L2 regularization
control the prevention of overfitting by adding penalties to the loss
function based on the magnitudes of parameters.
Dropout Rate: The probability of dropping neurons during training to
reduce overfitting. It introduces randomness by randomly setting some
neurons to zero during each iteration.

➢ Explain backpropagation with diagram

Backpropagation is a widely used algorithm for training feedforward neural
networks. It computes the gradient of the loss function with respect to the
network weights. It is very efficient, rather than naively directly computing the
gradient concerning each weight. This efficiency makes it possible to use
gradient methods to train multi-layer networks and update weights to minimize
loss; variants such as gradient descent or stochastic gradient descent are
often used.
The backpropagation algorithm works by computing the gradient of the loss
function with respect to each weight via the chain rule, computing the gradient
layer by layer, and iterating backward from the last layer to avoid redundant
computation of intermediate terms in the chain rule.

Working of Backpropagation:

Neural networks use supervised learning to generate output vectors from

input vectors that the network operates on. It Compares generated output to
the desired output and generates an error report if the result does not match
the generated output vector. Then it adjusts the weights according to the bug
report to get your desired output.

Backpropagation Algorithm:

Step 1: Inputs X, arrive through the preconnected path.

Step 2: The input is modeled using true weights W. Weights are usually
chosen randomly.
Step 3: Calculate the output of each neuron from the input layer to the hidden
layer to the output layer.
Step 4: Calculate the error in the outputs
Backpropagation Error= Actual Output – Desired Output
Step 5: From the output layer, go back to the hidden layer to adjust the
weights to reduce the error.
Step 6: Repeat the process until the desired output is achieved.

Types of Backpropagation
There are two types of backpropagation networks.
● Static backpropagation: Static backpropagation is a network designed
to map static inputs for static outputs. These types of networks are
capable of solving static classification problems such as OCR (Optical
Character Recognition).
● Recurrent backpropagation: Recursive backpropagation is another
network used for fixed-point learning. Activation in recurrent
backpropagation is feed-forward until a fixed value is reached. Static
backpropagation provides an instant mapping, while recurrent
backpropagation does not provide an instant mapping.

➢ Explain Activation Functions with diagram and the properties it must

hold in neural network model.

(Diagram)
Activation functions determine whether a neuron should be activated (fire) or
not based on the weighted sum of its inputs. They introduce non-linearity to
the network, enabling it to model and learn complex patterns in the data.

Properties of Activation Functions:

Activation functions should ideally possess certain properties to ensure the

effectiveness of neural network training and performance. The primary
properties include:

1. Non-linearity: Activation functions must be non-linear to enable neural

networks to learn complex relationships. A composition of linear
functions remains linear, which limits the network's expressive power.
Non-linear activation functions introduce the capacity to approximate
any arbitrary function.
2. Continuity: Activation functions should be continuous, as this helps with
gradient-based optimization methods like gradient descent.
Discontinuous functions may lead to challenges in calculating gradients
and optimizing the network.
3. Monotonicity: Monotonic activation functions (functions that either
increase or decrease throughout their range) help in propagating
gradients consistently during backpropagation. This facilitates learning
and optimization.
4. Boundedness: Activation functions that have bounded output ranges
can prevent activation values from becoming too large and leading to
numerical instability. It can also help with stable learning and
convergence.
5. Differentiability: Activation functions must be differentiable almost
everywhere to enable gradient-based optimization algorithms like
backpropagation. This allows efficient calculation of gradients for
parameter updates.

Common Activation Functions:

➢ Sigmoid Function:
○ Formula: σ(x) = 1 / (1 + exp(-x))
○ Range: (0, 1)
○ S-shaped curve, used in the past for binary classification. Can
suffer from vanishing gradient problem for very high or low
inputs.
➢ Hyperbolic Tangent (Tanh) Function:
○ Formula: tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))
○ Range: (-1, 1)
○ Similar to the sigmoid but zero-centered, which mitigates the
vanishing gradient problem to some extent.
➢ Rectified Linear Unit (ReLU):
○ Formula: ReLU(x) = max(0, x)
○ Range: [0, ∞)
○ Widely used due to its simplicity and effectiveness. Addresses
vanishing gradient and helps with training deeper networks.
➢ Leaky ReLU:
○ Formula: LeakyReLU(x) = x if x > 0, else LeakyReLU(x) = α * x
where α is a small positive constant.
○ Range: (-∞, ∞)
○ Addresses the "dying ReLU" problem by allowing small
gradients for negative inputs.
➢ Parametric ReLU (PReLU):
○ Similar to Leaky ReLU but α is learned during training instead of
being a fixed constant.
➢ Exponential Linear Unit (ELU):
○ Formula: ELU(x) = x if x > 0, else ELU(x) = α * (exp(x) - 1) where
α is a positive constant.
○ Smooth transition around zero, which helps mitigate vanishing
gradient and allows negative values without loss of information.

8. Compare activation functions:

○ RELU
○ LRELU
○ ERELU
ReLU (Rectified Linear Unit):

● Formula: ReLU(x) = max(0, x)

● Range: [0, ∞)
● Pros:
● Simple and computationally efficient.
● Addresses the vanishing gradient problem by allowing non-zero
gradients for positive inputs.
● Promotes sparsity, where only some neurons activate.
● Cons:
● Can suffer from the "dying ReLU" problem, where neurons get
stuck in a non-activating state during training (gradients become
zero for negative inputs).

Leaky ReLU:
● Formula: LeakyReLU(x) = x if x > 0, else LeakyReLU(x) = α * x where α
is a small positive constant.
● Range: (-∞, ∞)
● Pros:
● Addresses the "dying ReLU" problem by allowing small
gradients for negative inputs.
● Maintains the sparsity-promoting properties of ReLU.
● Cons:
● Introduces a new hyperparameter α which needs to be tuned.
● May not perform as well as other activations for some tasks.

ELU (Exponential Linear Unit):

● Formula: ELU(x) = x if x > 0, else ELU(x) = α * (exp(x) - 1) where α is a

positive constant.
● Range: (-α, ∞)
● Pros:
● Smooth transition around zero, which helps mitigate vanishing
gradient and allows negative values without loss of information.
● Eliminates the "dying ReLU" problem as it has non-zero
gradients for all inputs.
● Better generalization performance on some tasks compared to
ReLU and Leaky ReLU.
● Cons:
● Slightly computationally more expensive due to the exponential
operation.

Comparison:

Advantages:
● ReLU: Simple, computationally efficient, addresses vanishing
gradient for positive inputs.
● Leaky ReLU: Addresses dying ReLU, maintains sparsity.
● ELU: Addresses dying ReLU, provides smooth transition, better
generalization.
Disadvantages:
● ReLU: Can suffer from dying ReLU, not well-suited for negative
inputs.
● Leaky ReLU: Introduces a hyperparameter, may not perform as
well for all tasks.
● ELU: Slightly more computationally expensive.
Dying Neurons:
● ReLU: Prone to dying neurons for negative inputs.
● Leaky ReLU: Less prone to dying neurons due to non-zero
gradient for negative inputs.
● ELU: Non-zero gradient for all inputs, effectively eliminates
dying neurons.
Smoothness:
● ReLU: Not smooth at zero.
● Leaky ReLU: Not smooth at zero, but continuous.
● ELU: Smooth everywhere, including zero.
Sparsity:
● ReLU: Promotes sparsity due to zero activation for negative
inputs.
● Leaky ReLU: Maintains sparsity-promoting property of ReLU.
● ELU: Less sparsity-promoting due to non-zero values for
negative inputs.

9. What is Hyperparameter? Describe categories of hyperparameter.

A hyperparameter is a parameter that is not learned by the neural network
during training but is set before the training process begins. These
parameters influence the behavior and performance of the neural network and
play a critical role in determining how the network learns from the data.

Hyperparameters are often set by the developer or researcher based on

domain knowledge, trial-and-error experimentation, or automated techniques
like hyperparameter tuning. Choosing appropriate hyperparameter values can
significantly impact the convergence speed, model performance, and
generalization of the neural network.

Categories of Hyperparameters:

Hyperparameters can be broadly categorized into different groups based on

their roles and the aspects of the neural network they affect:

Model Architecture Hyperparameters:

● These hyperparameters define the fundamental structure of the
neural network.
● Examples:
● Number of layers: The depth of the network, including
input, hidden, and output layers.
● Number of neurons in each layer: Determines the
capacity and complexity of the network.
● Type of layers: Choices like fully connected,
convolutional, recurrent, etc.
Training Hyperparameters:
● These hyperparameters affect the training process and
optimization of the neural network.
● Examples:
● Learning rate (α): Controls the step size in gradient
descent optimization.
● Batch size: The number of samples used in each iteration
during training.
● Number of epochs: The number of times the entire
training dataset is processed during training.
● Optimization algorithm: Specifies how parameter updates
are performed (e.g., SGD, Adam).
Regularization Hyperparameters:
● Regularization techniques prevent overfitting by adding
penalties to the loss function.
● Examples:
● L1 and L2 regularization strengths: Control the amount of
penalty applied to the weights.
● Dropout rate: Probability of dropping neurons during
training to prevent overfitting.
Initialization Hyperparameters:
● Initialization techniques set the initial values of the weights and
biases in the neural network.
● Examples:
● Weight initialization schemes: Choices like random,
Xavier/Glorot, He initialization.
● Bias initialization: Setting biases to zero or small
positive/negative values.
Activation Function Hyperparameters:
● Activation functions introduce non-linearity to the network and
influence the flow of information.
● Examples:
● Type of activation function: Choices like ReLU, sigmoid,
tanh, etc.
● Parameters of the activation functions: For instance, the
slope of Leaky ReLU.
Learning Rate Schedule Hyperparameters:
● These hyperparameters control how the learning rate changes
during training.
● Examples:
● Learning rate decay: Gradually decreasing the learning
rate over epochs.
● Learning rate annealing: Adjusting the learning rate
based on a pre-defined schedule.
Batch Normalization and Regularization Hyperparameters:
● These hyperparameters are specific to techniques like batch
normalization and dropout.
● Examples:
● Batch normalization hyperparameters: Adjusting
parameters like momentum and epsilon.
● Drop rate for dropout: Controlling the probability of
dropping neurons.

10. What is vanishing gradient problem? How to identify the problem and list
the solutions avoid it?
The vanishing gradient problem is a challenge that occurs during the training
of deep neural networks, particularly networks with many layers. It refers to
the situation where the gradients of the loss function with respect to the
network's parameters become very small as they are propagated backward
through the layers during the training process. This leads to slow
convergence, difficulty in updating earlier layers' parameters, and impedes the
learning process.

Identifying the Vanishing Gradient Problem:

You can identify the vanishing gradient problem through the following
observations during training:

Slow Convergence: The loss function decreases very slowly during

training, making the learning process time-consuming.
Small Gradient Magnitudes: The gradients of the early layers (closer to
the input) become extremely small, indicating that the network
struggles to learn from the data effectively.
Stagnant or Oscillating Training: The training process may appear to
stagnate, and the network may struggle to generalize to the validation
or test data.
Ineffective Learning in Early Layers: The weights of early layers hardly
change, implying that they are not being updated effectively.

Solutions to Avoid the Vanishing Gradient Problem:

Several techniques have been developed to mitigate the vanishing gradient

problem and enable the successful training of deep neural networks:

Activation Functions:
● Avoid using activation functions that squash the output into a
small range (e.g., sigmoid or tanh). Instead, use non-linear
activation functions like ReLU, Leaky ReLU, or ELU that allow
gradients to flow more freely for positive inputs.
Batch Normalization:
● Batch normalization normalizes the activations in each layer,
helping in maintaining reasonable gradient magnitudes
throughout the network. It can significantly mitigate the
vanishing gradient problem.
Skip Connections and Residual Networks (ResNets):
● Skip connections, introduced in ResNets, allow information to
bypass one or more layers. This helps gradients flow more
directly and can alleviate the vanishing gradient problem.
Gradient Clipping:
● Gradient clipping involves capping the gradients during training
to prevent them from becoming too small. This helps in
controlling the magnitude of gradients and ensures smoother
learning.
Weight Initialization:
● Use appropriate weight initialization techniques, such as He
initialization, which can help in balancing the scale of activations
and gradients.
Use Shallow Networks:
● If you suspect the vanishing gradient problem is severe,
consider using shallower networks or architectures with fewer
layers to reduce the depth-related issues.
Learning Rate Scheduling:
● Gradually reducing the learning rate during training can help
avoid extremely small gradients by slowing down the learning
process.
Rescale Inputs:
● Rescale input features to be closer to the zero mean and unit
variance. This can prevent the activations from becoming too
small.
Use LSTM and GRU Architectures:
● If you're dealing with sequential data, Long Short-Term Memory
(LSTM) and Gated Recurrent Unit (GRU) architectures are
designed to mitigate vanishing gradient problems in recurrent
networks.
Attention Mechanisms:
● Attention mechanisms, especially in natural language
processing tasks, can help the network focus on relevant parts
of the input sequence, reducing the impact of vanishing
gradients.
11. What are optimizaion algorithms? Explain gradient descent. (Notes)
12. Explain the following optimization algorithms: (Any 2)
○ Gradient Descent
○ Stochastic Gradient Descent
○ Mini-Batch Gradient Descent

1. Gradient Descent:

Gradient Descent is an optimization algorithm used to minimize a loss

function by iteratively adjusting the model's parameters in the direction that
reduces the loss. It computes the gradient of the loss function with respect to
the parameters and updates the parameters by taking steps proportional to
the negative gradient.

Algorithm Steps:

Initialize the model's parameters.

Calculate the loss by comparing predicted values to actual targets.
Compute the gradient of the loss with respect to each parameter using
backpropagation.
Update each parameter using the gradient and a learning rate (α),
scaled by the gradient's magnitude.

Advantages and Disadvantages:

● Advantages: Simple, widely used, can converge to a global minimum if

the loss function is convex.
● Disadvantages: Computationally expensive for large datasets, can
converge slowly in areas with shallow gradients or saddle points.

2. Stochastic Gradient Descent (SGD):

Stochastic Gradient Descent is an optimization variant that updates

parameters using the gradient of the loss function for a single randomly
chosen training sample at each iteration. This introduces randomness and
noise but can lead to faster updates and escape local minima.

Algorithm Steps:

Initialize the model's parameters.

Randomly select a training sample.
Calculate the loss and gradient for the selected sample.
Update parameters using the gradient and learning rate.
Repeat steps 2 to 4 for multiple iterations.
Advantages and Disadvantages:

● Advantages: Faster updates, can escape local minima, handles large

datasets more efficiently.
● Disadvantages: Noisy updates due to randomness, can lead to
oscillations, might not converge well if learning rate is not well-tuned.

3. Mini-Batch Gradient Descent:

Mini-Batch Gradient Descent combines the benefits of both Gradient Descent

and Stochastic Gradient Descent by updating parameters using the average
gradient of a small batch of randomly chosen training samples. This approach
balances noise and computational efficiency.

Algorithm Steps:

Initialize the model's parameters.

Split the training dataset into mini-batches.
For each mini-batch, calculate the loss and average gradient.
Update parameters using the average gradient and learning rate.
Repeat steps 3 and 4 for multiple iterations over all mini-batches.

Advantages and Disadvantages:

● Advantages: Efficient and less noisy updates compared to SGD,

handles large datasets well.
● Disadvantages: May require tuning the batch size, still not guaranteed
to converge to a global minimum.

13. What is regularization? List the popular regularization techniques and

explain any two.
Regularization is a set of techniques used to prevent overfitting in machine
learning models, particularly in deep neural networks. Overfitting occurs when
a model learns to perform well on the training data but fails to generalize to
new, unseen data. Regularization methods aim to introduce additional
constraints or penalties to the training process, discouraging the model from
fitting noise or irrelevant patterns in the data.

Popular Regularization Techniques:

1. L1 Regularization (Lasso):
● Adds a penalty term to the loss function proportional to the
absolute values of the model's parameters.
● Encourages sparsity in the parameter values, leading to some
parameters being exactly zero.
● Useful for feature selection, as it tends to set less important
features' weights to zero.
● Loss Function: L(θ) + λ * Σ|θ|, where L(θ) is the original loss, λ is
the regularization strength, and θ represents model parameters.
2. L2 Regularization (Ridge):
● Adds a penalty term to the loss function proportional to the
squared values of the model's parameters.
● Encourages small parameter values, but does not force them to
exactly zero.
● Helps in preventing overfitting and stabilizing the training
process.
● Loss Function: L(θ) + λ * Σθ^2, where L(θ) is the original loss, λ
is the regularization strength, and θ represents model
parameters.
3. Dropout:
● A regularization technique specific to neural networks.
● During training, randomly "drops out" (sets to zero) a proportion
of neurons in a layer with a specified probability.
● Forces the network to learn robust features that do not depend
heavily on any single neuron.
● Essentially trains multiple sub-networks in parallel, improving
generalization.
● Not used during inference; instead, the weights of the retained
neurons are scaled to account for the dropped-out neurons.

Two Regularization Techniques Explained in Detail:

1. L1 Regularization (Lasso):

L1 regularization encourages sparsity in the model's parameters. This means

it forces some of the parameter values to be exactly zero, effectively
performing feature selection during training. The loss function is modified by
adding a term proportional to the sum of the absolute values of the
parameters.

Benefits:

● Feature selection: L1 regularization can automatically select relevant

features by setting irrelevant feature weights to zero.
● Simplicity: It can lead to simpler and more interpretable models by
reducing the number of non-zero parameters.
Challenges:

● Hyperparameter tuning: The value of the regularization strength λ

needs to be carefully chosen to balance the trade-off between fitting
the data and penalizing the parameters.

2. L2 Regularization (Ridge):

L2 regularization encourages small parameter values without forcing them to

be exactly zero. It adds a term proportional to the sum of the squared values
of the parameters to the loss function.

Benefits:

● Smoothness: L2 regularization can lead to smoother weight values,

which can help prevent large variations between parameter values.
● Improved generalization: By discouraging large parameter values, L2
regularization contributes to improved generalization on unseen data.

Challenges:

● Less feature selection: L2 regularization does not perform feature

selection as aggressively as L1 regularization. It tends to keep all
features in the model but with smaller magnitudes.

14. Compare:
a. Drop out and Drop connect
b. L1 and L2

Dropout vs. DropConnect:

1. Dropout:

● Dropout is a regularization technique introduced by Geoffrey Hinton

and his collaborators. It's primarily used in neural networks, especially
deep neural networks.
● During each training iteration, a fraction (dropout rate) of neurons in a
layer is randomly selected and "dropped out." This means their outputs
are set to zero for that iteration.
● The remaining active neurons' outputs are scaled by the inverse of the
dropout rate to compensate for the dropped-out neurons.
● This process creates a form of ensemble learning, as the network
trains on various sub-networks with different neuron configurations.
● During inference (testing or predictions), dropout is turned off, but the
retained neuron outputs are scaled by the dropout rate to ensure
consistent magnitudes.

2. DropConnect:

● DropConnect is a generalization of dropout that extends the idea to

connections (weights) between neurons, rather than applying dropout
to neurons themselves.
● Similar to dropout, during each training iteration, a fraction of the
weights in the network are randomly set to zero.
● The process of randomly setting weights to zero and then scaling the
retained weights encourages the network to learn from different
sub-networks in each iteration.
● DropConnect introduces randomness at the connection level, affecting
the network's learning process.

Comparison:

● Both Dropout and DropConnect introduce randomness to the network,

effectively creating an ensemble of sub-networks, leading to better
generalization and reduced overfitting.
● Dropout focuses on neurons, while DropConnect focuses on the
individual weights or connections.
● Dropout tends to influence the network's capacity more significantly
since it affects the number of active neurons, whereas DropConnect
primarily affects the strengths of connections.
● DropConnect can be more computationally expensive due to the need
to handle individual weights.

L1 Regularization vs. L2 Regularization:

1. L1 Regularization (Lasso):

● L1 regularization, also known as Lasso (Least Absolute Shrinkage and

Selection Operator), involves adding a penalty term to the loss function
proportional to the sum of the absolute values of the model's
parameters.
● This encourages many parameter values to become exactly zero,
leading to sparse models where some features are effectively
excluded.
● L1 regularization is well-suited when you suspect that many features
are irrelevant or redundant.

2. L2 Regularization (Ridge):

● L2 regularization, also known as Ridge regularization, involves adding a

penalty term to the loss function proportional to the sum of the
squared values of the model's parameters.
● L2 regularization encourages small parameter values without forcing
them to be exactly zero. It smooths out parameter values.
● It's particularly effective in preventing the model from relying too
heavily on a small number of features.

Comparison:

● Both L1 and L2 regularization add penalties to the loss function based

on parameter magnitudes to prevent overfitting.
● L1 regularization can perform feature selection as it drives some
parameters to exactly zero, leading to models with fewer active
features.
● L2 regularization retains all features but discourages large parameter
values, which helps in generalization.
● L1 regularization is more sensitive to outliers due to the absolute value
function, while L2 regularization is less sensitive due to the squared
term.

Laboratory Manual: Faculty of Engineering and Technology Bachelor of Technology
No ratings yet
Laboratory Manual: Faculty of Engineering and Technology Bachelor of Technology
50 pages
BCS302 Module 3notes R3qiw3xfd8iov9j5
No ratings yet
BCS302 Module 3notes R3qiw3xfd8iov9j5
32 pages
Lab07 - IAA202 - HE181705 - Nguyễn Xuân Phương
No ratings yet
Lab07 - IAA202 - HE181705 - Nguyễn Xuân Phương
13 pages
MACHINE LEARNING Question Bank
No ratings yet
MACHINE LEARNING Question Bank
11 pages
Anna University CP 5005-SOFTWARE QUALITY ASSURANCE AND TESTING
No ratings yet
Anna University CP 5005-SOFTWARE QUALITY ASSURANCE AND TESTING
3 pages
CN Unit-1 PPT
No ratings yet
CN Unit-1 PPT
232 pages
Neural Network-Soniya
100% (1)
Neural Network-Soniya
72 pages
Ge8076-Professional-Ethics-In-Engineering Question Bank
No ratings yet
Ge8076-Professional-Ethics-In-Engineering Question Bank
19 pages
A Hybrid Machine Learning Method For Estimating Software Project Cost
100% (1)
A Hybrid Machine Learning Method For Estimating Software Project Cost
7 pages
Eia Mid-2 Question Paper Ece
No ratings yet
Eia Mid-2 Question Paper Ece
4 pages
Neural Network Notes Unit 1
100% (1)
Neural Network Notes Unit 1
91 pages
004artificial Intelligence 3rd Ed by Elaine Rich Kevin Knight Amp Shivashankar Nair
No ratings yet
004artificial Intelligence 3rd Ed by Elaine Rich Kevin Knight Amp Shivashankar Nair
44 pages
A Neural-CBR System For Real Property Valuation: Adebola G. Musa, Olawande Daramola, Alfred Owoloko, Oludayo Olugbara
No ratings yet
A Neural-CBR System For Real Property Valuation: Adebola G. Musa, Olawande Daramola, Alfred Owoloko, Oludayo Olugbara
12 pages
PCML Notes
No ratings yet
PCML Notes
249 pages
Sit Ece Syllabus Book 26-09-2022
No ratings yet
Sit Ece Syllabus Book 26-09-2022
276 pages
Computer Networks Question Bank
100% (1)
Computer Networks Question Bank
2 pages
CO - CSE 4102 - AI Lab Course Outline
100% (1)
CO - CSE 4102 - AI Lab Course Outline
4 pages
LSTM Lecture
No ratings yet
LSTM Lecture
163 pages
Stock Prediction Using Machine Learning Google Scholar
No ratings yet
Stock Prediction Using Machine Learning Google Scholar
8 pages
DL Unit-2
No ratings yet
DL Unit-2
31 pages
Data Analytics Unit-2 PPT Notes
No ratings yet
Data Analytics Unit-2 PPT Notes
190 pages
M.Tech CSE Syllabus Notes
No ratings yet
M.Tech CSE Syllabus Notes
32 pages
UNIT 4 - Perceptron and DL
No ratings yet
UNIT 4 - Perceptron and DL
39 pages
Jain
No ratings yet
Jain
33 pages
CS3491 - Notes - Unit 4 - Ensemble Techniques and Unsupervised Learning
No ratings yet
CS3491 - Notes - Unit 4 - Ensemble Techniques and Unsupervised Learning
35 pages
Nonlinear Causality Test in R
No ratings yet
Nonlinear Causality Test in R
12 pages
2 Marks With Answers
83% (6)
2 Marks With Answers
14 pages
VTU EC EBCS CCN Module1 Raghudathesh
100% (1)
VTU EC EBCS CCN Module1 Raghudathesh
94 pages
AI and IP
No ratings yet
AI and IP
60 pages
Unit 1 Fundamentals of Deep Learning
No ratings yet
Unit 1 Fundamentals of Deep Learning
20 pages
A MCN Questions
No ratings yet
A MCN Questions
16 pages
M5 Competitors Guide Final 10 March 2020
No ratings yet
M5 Competitors Guide Final 10 March 2020
16 pages
Ann MLP
No ratings yet
Ann MLP
56 pages
Convolutional Neural Network Layers Implementation On Low-Cost Reconfigurable Edge Computing Platforms
No ratings yet
Convolutional Neural Network Layers Implementation On Low-Cost Reconfigurable Edge Computing Platforms
31 pages
Feedforward
No ratings yet
Feedforward
34 pages
Deep Representation Learning in Speech Processing Challenges, Recent Advances, and Future Trends
No ratings yet
Deep Representation Learning in Speech Processing Challenges, Recent Advances, and Future Trends
25 pages
Sigmoid Function: Soft Computing Assignment
100% (1)
Sigmoid Function: Soft Computing Assignment
12 pages
FDP Day1
No ratings yet
FDP Day1
35 pages
Deep Learning: Huawei AI Academy Training Materials
No ratings yet
Deep Learning: Huawei AI Academy Training Materials
47 pages
Advanced Java Unit 3 Digital Notes
100% (1)
Advanced Java Unit 3 Digital Notes
67 pages
Question Bank of Applied Machine Learning
No ratings yet
Question Bank of Applied Machine Learning
2 pages
PHD IT Syllabus 01
No ratings yet
PHD IT Syllabus 01
27 pages
0704 - Introduction To FL - MIT ( )
No ratings yet
0704 - Introduction To FL - MIT ( )
24 pages
Software Development 500 Exam
No ratings yet
Software Development 500 Exam
8 pages
Machine Learning For Option Pricing
No ratings yet
Machine Learning For Option Pricing
29 pages
Unit - II: Recurrent Neural Network
No ratings yet
Unit - II: Recurrent Neural Network
75 pages
Calculation: Question 2: ANN-Multi Layer Perceptron (05 Marks)
No ratings yet
Calculation: Question 2: ANN-Multi Layer Perceptron (05 Marks)
2 pages
Springback Prediction in Sheet Metal Forming, Based On Finite Element Analysis and Artificial Neural Network Approach
No ratings yet
Springback Prediction in Sheet Metal Forming, Based On Finite Element Analysis and Artificial Neural Network Approach
14 pages
OOP Objective Paper MID
No ratings yet
OOP Objective Paper MID
2 pages
Big Data Unit5
No ratings yet
Big Data Unit5
57 pages
Embedded MCQ
No ratings yet
Embedded MCQ
5 pages
Daa Question Paper Winter 2024
No ratings yet
Daa Question Paper Winter 2024
8 pages
Unit 1
No ratings yet
Unit 1
70 pages
XOR Problem Tensorflow NN - Ipynb
No ratings yet
XOR Problem Tensorflow NN - Ipynb
29 pages
204CS001-Machine Learning Techniques.
No ratings yet
204CS001-Machine Learning Techniques.
1 page
10144CS801 MIDDLEWARE TECHNOLOGIES Syllabus
No ratings yet
10144CS801 MIDDLEWARE TECHNOLOGIES Syllabus
1 page
Unit 3 Endsem PYQs
No ratings yet
Unit 3 Endsem PYQs
19 pages
Unit 2,3 Ct2 Question Bank 4 Marks
No ratings yet
Unit 2,3 Ct2 Question Bank 4 Marks
3 pages
Towards Democratizing Joint-Embedding Self-Supervised Learning
No ratings yet
Towards Democratizing Joint-Embedding Self-Supervised Learning
11 pages
DL DL2 DL3 Merged
No ratings yet
DL DL2 DL3 Merged
11 pages
Neural Network Representation
No ratings yet
Neural Network Representation
5 pages
Data Compression
No ratings yet
Data Compression
46 pages
Multiple Choice Questions - Coa: Ans: A
No ratings yet
Multiple Choice Questions - Coa: Ans: A
8 pages
NN Question Bank VIISem
No ratings yet
NN Question Bank VIISem
42 pages
Capacity, Resilience and Virtual Embedding in Elastic Optical Networks Planning With Adopted Machine Learning
No ratings yet
Capacity, Resilience and Virtual Embedding in Elastic Optical Networks Planning With Adopted Machine Learning
21 pages
Unit 2a
No ratings yet
Unit 2a
31 pages
Dos Attack Detection Using Machine Learning and Neural Network
No ratings yet
Dos Attack Detection Using Machine Learning and Neural Network
5 pages
WC Lab Manual
No ratings yet
WC Lab Manual
27 pages
DCCN Notes
No ratings yet
DCCN Notes
27 pages
ANNand Its Applications
No ratings yet
ANNand Its Applications
16 pages
A LSTM Neural Network Applied To Mobile Robots Path Planning
No ratings yet
A LSTM Neural Network Applied To Mobile Robots Path Planning
6 pages
DS Lecture 01 - Introduction PDF
No ratings yet
DS Lecture 01 - Introduction PDF
23 pages
Features of Authoring Tools: Card and Page Based Tools
No ratings yet
Features of Authoring Tools: Card and Page Based Tools
4 pages
Introduction To Algorithms: Chapter 3: Growth of Functions
No ratings yet
Introduction To Algorithms: Chapter 3: Growth of Functions
29 pages
Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming
No ratings yet
Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming
11 pages
Artificial Intelligence 4. Knowledge Representation: Course V231 Department of Computing Imperial College, London
No ratings yet
Artificial Intelligence 4. Knowledge Representation: Course V231 Department of Computing Imperial College, London
26 pages
CS8591 CN Lecture Plan Format
No ratings yet
CS8591 CN Lecture Plan Format
5 pages
Neural Networks and Statistical Models
No ratings yet
Neural Networks and Statistical Models
13 pages
UEC727
No ratings yet
UEC727
1 page
PPS Course Material
100% (1)
PPS Course Material
177 pages
ML Viva Questions
No ratings yet
ML Viva Questions
8 pages
Ec1to6 PDF
No ratings yet
Ec1to6 PDF
61 pages
Unit - V Implementation, Testing & Maintenance
No ratings yet
Unit - V Implementation, Testing & Maintenance
60 pages
Open-AI Driven Open-Source Open-Access Sustainable ICs Design Flow
No ratings yet
Open-AI Driven Open-Source Open-Access Sustainable ICs Design Flow
5 pages
CISC 867: Deep Learning Assignment #1: K J Net
No ratings yet
CISC 867: Deep Learning Assignment #1: K J Net
3 pages
DCCN Prefinal Paper
No ratings yet
DCCN Prefinal Paper
2 pages
Syllabus: Computer Architecture AND Parallel Processing
No ratings yet
Syllabus: Computer Architecture AND Parallel Processing
1 page
Int. To Data Analytics and Cyber Security Syllabus
No ratings yet
Int. To Data Analytics and Cyber Security Syllabus
2 pages
CS2402 MOBILE COMPUTING Anna University Question Bank
0% (1)
CS2402 MOBILE COMPUTING Anna University Question Bank
6 pages
Algorithms Flowcharts Notes
100% (4)
Algorithms Flowcharts Notes
4 pages

Unit 1 Question and Answers

Uploaded by

Unit 1 Question and Answers

Uploaded by

Unit 1 Question and Answers

➢ What is Deep Learning? List the major architectures of deep networks

Key Components of Deep Learning:

■ Neural Networks: At the core of deep learning are artificial neural

Major Architectures of Deep Networks:

➔ Feedforward Neural Networks (FNN): This is the simplest form

➢ Explain the structure of Feed Forward Neural Network with diagram.

A Feed Forward Neural Network consists of multiple layers of interconnected

A Multilayer Perceptron (MLP) is a type of feedforward neural network

Structure and Working of MLP:

In the diagram, each layer of the Multilayer Perceptron is visually

● Input Layer: The input layer consists of neurons representing the

Components of a Neural Network:

Detailed Diagram Explanation:

➔ The input layer consists of neurons representing individual input

How do artificial neural networks work?

Artificial Neural Network can be best represented as a weighted directed

Afterward, each of the input is multiplied by its corresponding weights ( these

1. Connection weights and Biases

3. Layers and Parameters

● A generic loss function is often denoted as L or loss.

Notation: The learning rate is often denoted as α or learning_rate.

7. Sample and batch

In the training process of deep neural networks, data is organized into

● A single data point (input-output pair) is often denoted as (x, y) or

● A batch is a collection of multiple samples grouped together.

● There isn't a specific universal notation for hyperparameters; they are

Impact on the Network: Hyperparameters control various aspects of

➢ Explain backpropagation with diagram

Neural networks use supervised learning to generate output vectors from

Step 1: Inputs X, arrive through the preconnected path.

➢ Explain Activation Functions with diagram and the properties it must

Properties of Activation Functions:

Activation functions should ideally possess certain properties to ensure the

1. Non-linearity: Activation functions must be non-linear to enable neural

Common Activation Functions:

8. Compare activation functions:

● Formula: ReLU(x) = max(0, x)

ELU (Exponential Linear Unit):

● Formula: ELU(x) = x if x > 0, else ELU(x) = α * (exp(x) - 1) where α is a

9. What is Hyperparameter? Describe categories of hyperparameter.

Hyperparameters are often set by the developer or researcher based on

Hyperparameters can be broadly categorized into different groups based on

​ Model Architecture Hyperparameters:

Identifying the Vanishing Gradient Problem:

​ Slow Convergence: The loss function decreases very slowly during

Solutions to Avoid the Vanishing Gradient Problem:

Several techniques have been developed to mitigate the vanishing gradient

Gradient Descent is an optimization algorithm used to minimize a loss

​ Initialize the model's parameters.

Advantages and Disadvantages:

● Advantages: Simple, widely used, can converge to a global minimum if

2. Stochastic Gradient Descent (SGD):

Stochastic Gradient Descent is an optimization variant that updates

​ Initialize the model's parameters.

● Advantages: Faster updates, can escape local minima, handles large

3. Mini-Batch Gradient Descent:

Mini-Batch Gradient Descent combines the benefits of both Gradient Descent

​ Initialize the model's parameters.

Advantages and Disadvantages:

● Advantages: Efficient and less noisy updates compared to SGD,

13. What is regularization? List the popular regularization techniques and

Popular Regularization Techniques:

Two Regularization Techniques Explained in Detail:

L1 regularization encourages sparsity in the model's parameters. This means

● Feature selection: L1 regularization can automatically select relevant

● Hyperparameter tuning: The value of the regularization strength λ

L2 regularization encourages small parameter values without forcing them to

● Smoothness: L2 regularization can lead to smoother weight values,

● Less feature selection: L2 regularization does not perform feature

Dropout vs. DropConnect:

● Dropout is a regularization technique introduced by Geoffrey Hinton

● DropConnect is a generalization of dropout that extends the idea to

● Both Dropout and DropConnect introduce randomness to the network,

L1 Regularization vs. L2 Regularization:

● L1 regularization, also known as Lasso (Least Absolute Shrinkage and

Model Architecture Hyperparameters:

Slow Convergence: The loss function decreases very slowly during

Initialize the model's parameters.

Initialize the model's parameters.

Initialize the model's parameters.