100% found this document useful (1 vote)
158 views

Unit 1 Question and Answers

Deep learning is a form of machine learning that uses artificial neural networks inspired by the human brain. These neural networks consist of interconnected nodes organized into layers that can automatically learn representations of data. The major deep learning architectures are feedforward neural networks, convolutional neural networks, and recurrent neural networks. A feedforward neural network passes information from the input to output layers without loops. It contains an input layer, one or more hidden layers, and an output layer. A multilayer perceptron is a type of feedforward network with an input, hidden, and output layer where neurons in each layer are fully connected.

Uploaded by

Reason
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
158 views

Unit 1 Question and Answers

Deep learning is a form of machine learning that uses artificial neural networks inspired by the human brain. These neural networks consist of interconnected nodes organized into layers that can automatically learn representations of data. The major deep learning architectures are feedforward neural networks, convolutional neural networks, and recurrent neural networks. A feedforward neural network passes information from the input to output layers without loops. It contains an input layer, one or more hidden layers, and an output layer. A multilayer perceptron is a type of feedforward network with an input, hidden, and output layer where neurons in each layer are fully connected.

Uploaded by

Reason
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Unit 1 Question and Answers

➢ What is Deep Learning? List the major architectures of deep networks


➢ Deep Learning is a subset of machine learning that focuses on training
artificial neural networks to perform tasks that typically require human
intelligence. It involves learning representations of data through
multiple layers of interconnected nodes (neurons) inspired by the
structure and function of the human brain. Deep Learning has gained
significant attention and success in various fields, including computer
vision, natural language processing, speech recognition, and more.
➢ Deep Learning models automatically learn features from data,
eliminating the need for manual feature engineering. The architectures
of deep networks are designed to handle complex patterns and
relationships within data, making them capable of achieving
state-of-the-art performance on tasks like image classification, object
detection, machine translation, and more.

Key Components of Deep Learning:

■ Neural Networks: At the core of deep learning are artificial neural


networks (ANNs). These networks consist of interconnected
nodes, or neurons, organized into layers. Neurons in one layer
are connected to neurons in the adjacent layers through
weighted connections. Each connection has an associated
weight that adjusts during training.
■ Layers:
➔ Input Layer: This is where data is fed into the network.
Each neuron in the input layer corresponds to a feature or
attribute of the input data.
➔ Hidden Layers: These are intermediate layers between
the input and output layers. Each hidden layer processes
and transforms the data through a combination of
weighted sums and activation functions.
➔ Output Layer: The final layer of the network produces the
model's predictions or classifications.
■ Activation Functions: Neurons apply activation functions to the
weighted sum of their inputs to introduce non-linearity to the
model. Common activation functions include ReLU (Rectified
Linear Unit), sigmoid, and tanh. Activation functions enable
neural networks to learn complex relationships within data.
■ Learning Algorithms: Deep learning models learn by adjusting
the weights of connections between neurons to minimize a
defined loss function. This process involves optimization
algorithms like gradient descent, which iteratively updates the
weights in the direction that reduces the loss.
■ Backpropagation: Backpropagation is a fundamental technique
used to train deep learning models. It involves computing
gradients of the loss function with respect to the model's
weights and using these gradients to update the weights in a
way that minimizes the loss. This process is responsible for
making the network learn and improve its performance over
time.

Major Architectures of Deep Networks:

➔ Feedforward Neural Networks (FNN): This is the simplest form


of a deep neural network, also known as a multilayer perceptron
(MLP). It consists of an input layer, one or more hidden layers,
and an output layer. Each neuron in one layer is connected to
every neuron in the subsequent layer. FNNs are used for tasks
like regression and classification.
➔ Convolutional Neural Networks (CNN): CNNs are primarily used
for image analysis and computer vision tasks. They utilize
convolutional layers to automatically learn spatial hierarchies of
features from images. These networks are well-suited for tasks
such as image classification, object detection, and image
segmentation.
➔ Recurrent Neural Networks (RNN): RNNs are designed to handle
sequential data, such as time series and natural language. They
have loops that allow information to persist across different
time steps, making them capable of capturing temporal
dependencies. Long Short-Term Memory (LSTM) and Gated
Recurrent Unit (GRU) are variations of RNNs that help alleviate
the vanishing gradient problem.

➢ Explain the structure of Feed Forward Neural Network with diagram.

A Feed Forward Neural Network consists of multiple layers of interconnected


neurons, each layer passing information from the input layer to the output
layer in a "feed forward" manner. The data flows in one direction, through the
hidden layers, and ultimately produces an output. The FNN can be divided into
three main types of layers: the input layer, hidden layers, and the output layer.
​ Input Layer:
● The input layer is the starting point of the network and
represents the raw features or attributes of the input data.
● Each neuron in the input layer corresponds to a feature of the
input data. For example, in image analysis, each pixel's intensity
could be a feature.
● The number of neurons in the input layer is determined by the
dimensionality of the input data.
​ Hidden Layers:
● Hidden layers are intermediate layers between the input and
output layers.
● Each hidden layer contains a certain number of neurons (also
called units), which learn to capture complex patterns and
relationships in the data.
● The number of hidden layers and the number of neurons in each
layer are design choices that can impact the network's capacity
to learn and its computational efficiency.
● Neurons in hidden layers apply weighted sums of their inputs,
followed by an activation function, to produce their outputs.
These outputs are then passed as inputs to the next layer.
​ Output Layer:
● The output layer produces the final results of the network's
computation.
● The number of neurons in the output layer is determined by the
nature of the task. For example, in binary classification, there
might be one neuron for the positive class and one for the
negative class. In multi-class classification, there would be as
many neurons as there are classes.
● The activation function used in the output layer depends on the
task. For binary classification, a sigmoid function is often used,
while for multi-class classification, a softmax function is
common.
(notes and book)
➢ Explain multilayer perceptron with diagram.

A Multilayer Perceptron (MLP) is a type of feedforward neural network


architecture that consists of an input layer, one or more hidden layers, and an
output layer. Each layer contains interconnected neurons, which are the basic
processing units that compute weighted sums of their inputs and apply an
activation function to produce an output. MLPs are widely used for various
machine learning tasks, including regression and classification.

Structure and Working of MLP:

​ Input Layer:
● The input layer serves as the initial point of the network and
represents the input data features.
● Each neuron in the input layer corresponds to a specific feature
of the input data. For instance, in image analysis, each pixel's
intensity could be a feature.
● The number of neurons in the input layer is determined by the
dimensionality of the input data.
​ Hidden Layers:
● Hidden layers are intermediary layers between the input and
output layers. They allow the network to capture complex
relationships within the data.
● Neurons in the hidden layers process the information received
from the previous layer and apply weighted sums of inputs.
● Each neuron's output is then passed through an activation
function. Common activation functions include the Rectified
Linear Unit (ReLU) and the hyperbolic tangent (tanh) function.
● The number of hidden layers and the number of neurons in each
layer are hyperparameters that can be adjusted based on the
complexity of the problem and the available data.
​ Output Layer:
● The output layer produces the final results of the network's
computations.
● The number of neurons in the output layer depends on the
specific task. For example, in binary classification, there might
be one neuron for the positive class and one for the negative
class. In multi-class classification, there would be a neuron for
each class.
● The activation function used in the output layer depends on the
nature of the task. For binary classification, a sigmoid function is
commonly used. For multi-class classification, a softmax
function is often employed.

In the diagram, each layer of the Multilayer Perceptron is visually


represented along with the interconnections between neurons. Here's a
breakdown of the diagram:

● Input Layer: The input layer consists of neurons representing the


input features. In the diagram, each box at the top represents a
feature, and each arrow going from the input feature to the
neurons in the hidden layer indicates the input provided by that
feature to each neuron.
● Hidden Layers: The hidden layers are depicted in the middle
section of the diagram. Neurons in the hidden layers take inputs
from the previous layer's neurons (including the input layer) and
process them to produce an output. The connections between
neurons are shown as arrows, and each arrow has a weight
associated with it.
● Output Layer: The output layer is shown at the bottom of the
diagram. Neurons in the output layer take inputs from the last
hidden layer and generate the final output of the network. The
arrows leading from the hidden layer neurons to the output layer
neurons indicate the weighted inputs to each output neuron.
➢ What are the components of a Neural Network? Explain with diagram
A neural network consists of several key components that work
together to process input data, learn from it, and produce desired outputs.
These components include the input layer, hidden layers, weights, biases,
activation functions, and the output layer

Components of a Neural Network:

​ Input Layer:
● The input layer is the initial part of the neural network where raw
data is fed into the system.
● Each neuron in the input layer corresponds to a feature or
attribute of the input data. For example, in an image
classification task, each pixel's intensity could be a feature.
● The number of neurons in the input layer is determined by the
dimensionality of the input data. If you have N features, you'll
have N neurons in the input layer.
​ Hidden Layers:
● Hidden layers are situated between the input and output layers.
They perform the computational work of transforming the input
data into meaningful representations.
● Neurons within hidden layers apply weighted sums of their
inputs from the previous layer and then pass the result through
an activation function.
● Each hidden layer captures progressively abstract features as
you move deeper into the network. The number of hidden layers
and neurons per layer are design choices that impact the
network's capacity to learn complex patterns.
​ Weights and Biases:
● Weights represent the strengths of connections between
neurons in different layers. Each connection between neurons
has an associated weight.
● Biases are added to the weighted sum before passing through
the activation function. They allow the network to shift the
output values.
● Both weights and biases are learned during the training process,
aiming to optimize the network's performance on the given task.
​ Activation Functions:
● Activation functions introduce non-linearity to the neural
network. Without them, the network would behave like a linear
model, unable to capture complex relationships in the data.
● Common activation functions include:
● ReLU (Rectified Linear Unit): Applies zero for negative
inputs and the input itself for positive inputs.
● Sigmoid: Squeezes inputs into a range between 0 and 1.
● Tanh (Hyperbolic Tangent): Squeezes inputs into a range
between -1 and 1.
● Softmax: Used in the output layer of multi-class
classification to convert raw scores into probabilities.
​ Output Layer:
● The output layer produces the final results of the neural
network's computation.
● The number of neurons in the output layer depends on the task.
For example, for binary classification, there might be one neuron
for each class. For multi-class classification, there would be a
neuron for each class.
● The activation function in the output layer varies depending on
the task. For instance, in binary classification, a sigmoid
function is typically used. In multi-class classification, a softmax
function is commonly applied.

Detailed Diagram Explanation:

In the labeled diagram provided earlier, each component of the neural network
is visually represented:

➔ The input layer consists of neurons representing individual input


features.
➔ Hidden layers, denoted as "Hidden Neuron," process and transform
data through weighted connections and activation functions.
➔ Arrows between neurons represent weighted connections. Each arrow
has a corresponding weight.
➔ Bias terms are omitted in the diagram for simplicity.
➔ The output layer generates final predictions or classifications.

How do artificial neural networks work?

Artificial Neural Network can be best represented as a weighted directed


graph, where the artificial neurons form the nodes. The association between
the neurons outputs and neuron inputs can be viewed as the directed edges
with weights. The Artificial Neural Network receives the input signal from the
external source in the form of a pattern and image in the form of a vector.
These inputs are then mathematically assigned by the notations x(n) for every
n number of inputs.

Afterward, each of the input is multiplied by its corresponding weights ( these


weights are the details utilized by the artificial neural networks to solve a
specific problem ). In general terms, these weights normally represent the
strength of the interconnection between neurons inside the artificial neural
network. All the weighted inputs are summarized inside the computing unit.

If the weighted sum is equal to zero, then bias is added to make the output
non-zero or something else to scale up to the system's response. Bias has the
same input, and weight equals to 1. Here the total of weighted inputs can be
in the range of 0 to positive infinity. Here, to keep the response in the limits of
the desired value, a certain maximum value is benchmarked, and the total of
weighted inputs is passed through the activation function.

The activation function refers to the set of transfer functions used to achieve
the desired output. There is a different kind of the activation function, but
primarily either linear or non-linear sets of functions
➢ Explain the following terms denoting their notations and equations
(where necessary) with respect to deep neural networks:((Any-5)
○ Connection weights and Biases
○ Epoch
○ Layers and Parameters
○ Activation Functions
○ Loss/Cost Functions
○ Learning rate
○ Sample and batch
○ Hyperparameters

1. Connection weights and Biases


In a deep neural network, connection weights and biases are the
parameters that are learned during the training process. They control how the
network learns to map from its inputs to its outputs.
● Connection weights are the real numbers that are associated with each
connection between two neurons. They represent the strength of the
connection between the two neurons. A higher weight means that the
connection is stronger and will have a greater impact on the output of
the neuron.
● Biases are real numbers that are added to the output of each neuron
before it is passed to the activation function. They can be thought of as
a way of adjusting the output of the neuron so that it is more accurate.


● The weights and biases are updated during the training process using a
gradient descent algorithm. The goal of the gradient descent algorithm is to
find the weights and biases that minimize the error between the network's
predictions and the desired outputs.
Connection weights and biases are essential for the learning process in deep
neural networks. They allow the network to learn the relationships between its
inputs and outputs. By adjusting the weights and biases, the network can
learn to make more accurate predictions.

Here are some additional things to keep in mind about connection weights
and biases:

● The weights and biases are initialized randomly at the beginning of the
training process.
● The weights and biases are updated during the training process using
a gradient descent algorithm.
● The weights and biases are typically represented as matrices.
● The number of weights and biases in a neural network can be very
large.
● The weights and biases are important for the performance of the neural
network.

2. Epoch

3. Layers and Parameters


4. Activation Functions
5. Loss/Cost Functions

Loss functions measure the discrepancy between the predicted values of the
model and the actual ground truth values. The goal during training is to
minimize the value of the loss function, which essentially means making the
network's predictions as close to the actual values as possible.

Notation:

● A generic loss function is often denoted as L or loss.


● For a specific type of loss function, such as mean squared error, the
notation could be MSE(y_true, y_pred), where y_true represents the
true (ground truth) values and y_pred represents the predicted values.
6. Learning rate

The learning rate is a hyperparameter that determines the step size at which
the model's parameters (weights and biases) are updated during the
optimization process, such as gradient descent. It's a crucial parameter as it
controls the rate of convergence during training and affects how quickly the
model learns from the data.

Notation: The learning rate is often denoted as α or learning_rate.

7. Sample and batch

In the training process of deep neural networks, data is organized into


samples and processed in batches. These concepts are crucial for efficient
and effective training.

Notation:

● A single data point (input-output pair) is often denoted as (x, y) or


simply referred to as a sample.
● A collection of multiple samples is known as a batch.

Samples:

● A sample represents a single input data point (x) and its corresponding
target/output data (y).
● For example, in image classification, a sample could be an image and
its corresponding label.
● In mathematical notation, a single sample can be represented as (x_i,
y_i) where i is the index of the sample.

Batches:

● A batch is a collection of multiple samples grouped together.


● Batching allows training to be performed on subsets of the entire
dataset.
● A batch typically contains n samples, where n is the batch size.
● Batch processing improves memory efficiency and allows for
parallelization on hardware.
8. Hyperparameters

Hyperparameters are settings and values that are set before training a deep
neural network. These parameters are not learned during the training process;
instead, they are chosen by the developer or researcher and significantly
influence the behavior, capacity, and generalization of the neural network.

Notation:

● There isn't a specific universal notation for hyperparameters; they are


often named individually based on their roles (e.g., batch_size,
learning_rate, hidden_units).

Impact on the Network: Hyperparameters control various aspects of


the network's behavior, including its capacity, convergence rate, and
ability to avoid overfitting. Some key hyperparameters include:

​ Learning Rate (α): The step size at which the model's parameters are
updated during optimization. Too high a learning rate can lead to
divergence, while too low a learning rate can lead to slow convergence.
​ Batch Size (n): The number of samples in each training batch. A larger
batch size can lead to smoother parameter updates and more efficient
computation but might require more memory.
​ Number of Hidden Units/Layers: The depth and width of the network.
More hidden units and layers can increase the network's capacity to
learn complex patterns, but may also lead to overfitting.
​ Activation Functions: The non-linear functions applied to neuron
outputs. Different activation functions affect how information flows
through the network.
​ Regularization Strength: Hyperparameters like L1 and L2 regularization
control the prevention of overfitting by adding penalties to the loss
function based on the magnitudes of parameters.
​ Dropout Rate: The probability of dropping neurons during training to
reduce overfitting. It introduces randomness by randomly setting some
neurons to zero during each iteration.

➢ Explain backpropagation with diagram


Backpropagation is a widely used algorithm for training feedforward neural
networks. It computes the gradient of the loss function with respect to the
network weights. It is very efficient, rather than naively directly computing the
gradient concerning each weight. This efficiency makes it possible to use
gradient methods to train multi-layer networks and update weights to minimize
loss; variants such as gradient descent or stochastic gradient descent are
often used.
The backpropagation algorithm works by computing the gradient of the loss
function with respect to each weight via the chain rule, computing the gradient
layer by layer, and iterating backward from the last layer to avoid redundant
computation of intermediate terms in the chain rule.

Working of Backpropagation:

Neural networks use supervised learning to generate output vectors from


input vectors that the network operates on. It Compares generated output to
the desired output and generates an error report if the result does not match
the generated output vector. Then it adjusts the weights according to the bug
report to get your desired output.

Backpropagation Algorithm:

Step 1: Inputs X, arrive through the preconnected path.


Step 2: The input is modeled using true weights W. Weights are usually
chosen randomly.
Step 3: Calculate the output of each neuron from the input layer to the hidden
layer to the output layer.
Step 4: Calculate the error in the outputs
Backpropagation Error= Actual Output – Desired Output
Step 5: From the output layer, go back to the hidden layer to adjust the
weights to reduce the error.
Step 6: Repeat the process until the desired output is achieved.

Types of Backpropagation
There are two types of backpropagation networks.
● Static backpropagation: Static backpropagation is a network designed
to map static inputs for static outputs. These types of networks are
capable of solving static classification problems such as OCR (Optical
Character Recognition).
● Recurrent backpropagation: Recursive backpropagation is another
network used for fixed-point learning. Activation in recurrent
backpropagation is feed-forward until a fixed value is reached. Static
backpropagation provides an instant mapping, while recurrent
backpropagation does not provide an instant mapping.

➢ Explain Activation Functions with diagram and the properties it must


hold in neural network model.

(Diagram)
Activation functions determine whether a neuron should be activated (fire) or
not based on the weighted sum of its inputs. They introduce non-linearity to
the network, enabling it to model and learn complex patterns in the data.

Properties of Activation Functions:

Activation functions should ideally possess certain properties to ensure the


effectiveness of neural network training and performance. The primary
properties include:

1. Non-linearity: Activation functions must be non-linear to enable neural


networks to learn complex relationships. A composition of linear
functions remains linear, which limits the network's expressive power.
Non-linear activation functions introduce the capacity to approximate
any arbitrary function.
2. Continuity: Activation functions should be continuous, as this helps with
gradient-based optimization methods like gradient descent.
Discontinuous functions may lead to challenges in calculating gradients
and optimizing the network.
3. Monotonicity: Monotonic activation functions (functions that either
increase or decrease throughout their range) help in propagating
gradients consistently during backpropagation. This facilitates learning
and optimization.
4. Boundedness: Activation functions that have bounded output ranges
can prevent activation values from becoming too large and leading to
numerical instability. It can also help with stable learning and
convergence.
5. Differentiability: Activation functions must be differentiable almost
everywhere to enable gradient-based optimization algorithms like
backpropagation. This allows efficient calculation of gradients for
parameter updates.

Common Activation Functions:

➢ Sigmoid Function:
○ Formula: σ(x) = 1 / (1 + exp(-x))
○ Range: (0, 1)
○ S-shaped curve, used in the past for binary classification. Can
suffer from vanishing gradient problem for very high or low
inputs.
➢ Hyperbolic Tangent (Tanh) Function:
○ Formula: tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))
○ Range: (-1, 1)
○ Similar to the sigmoid but zero-centered, which mitigates the
vanishing gradient problem to some extent.
➢ Rectified Linear Unit (ReLU):
○ Formula: ReLU(x) = max(0, x)
○ Range: [0, ∞)
○ Widely used due to its simplicity and effectiveness. Addresses
vanishing gradient and helps with training deeper networks.
➢ Leaky ReLU:
○ Formula: LeakyReLU(x) = x if x > 0, else LeakyReLU(x) = α * x
where α is a small positive constant.
○ Range: (-∞, ∞)
○ Addresses the "dying ReLU" problem by allowing small
gradients for negative inputs.
➢ Parametric ReLU (PReLU):
○ Similar to Leaky ReLU but α is learned during training instead of
being a fixed constant.
➢ Exponential Linear Unit (ELU):
○ Formula: ELU(x) = x if x > 0, else ELU(x) = α * (exp(x) - 1) where
α is a positive constant.
○ Smooth transition around zero, which helps mitigate vanishing
gradient and allows negative values without loss of information.

8. Compare activation functions:


○ RELU
○ LRELU
○ ERELU
ReLU (Rectified Linear Unit):

● Formula: ReLU(x) = max(0, x)


● Range: [0, ∞)
● Pros:
● Simple and computationally efficient.
● Addresses the vanishing gradient problem by allowing non-zero
gradients for positive inputs.
● Promotes sparsity, where only some neurons activate.
● Cons:
● Can suffer from the "dying ReLU" problem, where neurons get
stuck in a non-activating state during training (gradients become
zero for negative inputs).

Leaky ReLU:
● Formula: LeakyReLU(x) = x if x > 0, else LeakyReLU(x) = α * x where α
is a small positive constant.
● Range: (-∞, ∞)
● Pros:
● Addresses the "dying ReLU" problem by allowing small
gradients for negative inputs.
● Maintains the sparsity-promoting properties of ReLU.
● Cons:
● Introduces a new hyperparameter α which needs to be tuned.
● May not perform as well as other activations for some tasks.

ELU (Exponential Linear Unit):

● Formula: ELU(x) = x if x > 0, else ELU(x) = α * (exp(x) - 1) where α is a


positive constant.
● Range: (-α, ∞)
● Pros:
● Smooth transition around zero, which helps mitigate vanishing
gradient and allows negative values without loss of information.
● Eliminates the "dying ReLU" problem as it has non-zero
gradients for all inputs.
● Better generalization performance on some tasks compared to
ReLU and Leaky ReLU.
● Cons:
● Slightly computationally more expensive due to the exponential
operation.

Comparison:

​ Advantages:
● ReLU: Simple, computationally efficient, addresses vanishing
gradient for positive inputs.
● Leaky ReLU: Addresses dying ReLU, maintains sparsity.
● ELU: Addresses dying ReLU, provides smooth transition, better
generalization.
​ Disadvantages:
● ReLU: Can suffer from dying ReLU, not well-suited for negative
inputs.
● Leaky ReLU: Introduces a hyperparameter, may not perform as
well for all tasks.
● ELU: Slightly more computationally expensive.
​ Dying Neurons:
● ReLU: Prone to dying neurons for negative inputs.
● Leaky ReLU: Less prone to dying neurons due to non-zero
gradient for negative inputs.
● ELU: Non-zero gradient for all inputs, effectively eliminates
dying neurons.
​ Smoothness:
● ReLU: Not smooth at zero.
● Leaky ReLU: Not smooth at zero, but continuous.
● ELU: Smooth everywhere, including zero.
​ Sparsity:
● ReLU: Promotes sparsity due to zero activation for negative
inputs.
● Leaky ReLU: Maintains sparsity-promoting property of ReLU.
● ELU: Less sparsity-promoting due to non-zero values for
negative inputs.

9. What is Hyperparameter? Describe categories of hyperparameter.


A hyperparameter is a parameter that is not learned by the neural network
during training but is set before the training process begins. These
parameters influence the behavior and performance of the neural network and
play a critical role in determining how the network learns from the data.

Hyperparameters are often set by the developer or researcher based on


domain knowledge, trial-and-error experimentation, or automated techniques
like hyperparameter tuning. Choosing appropriate hyperparameter values can
significantly impact the convergence speed, model performance, and
generalization of the neural network.

Categories of Hyperparameters:

Hyperparameters can be broadly categorized into different groups based on


their roles and the aspects of the neural network they affect:

​ Model Architecture Hyperparameters:


● These hyperparameters define the fundamental structure of the
neural network.
● Examples:
● Number of layers: The depth of the network, including
input, hidden, and output layers.
● Number of neurons in each layer: Determines the
capacity and complexity of the network.
● Type of layers: Choices like fully connected,
convolutional, recurrent, etc.
​ Training Hyperparameters:
● These hyperparameters affect the training process and
optimization of the neural network.
● Examples:
● Learning rate (α): Controls the step size in gradient
descent optimization.
● Batch size: The number of samples used in each iteration
during training.
● Number of epochs: The number of times the entire
training dataset is processed during training.
● Optimization algorithm: Specifies how parameter updates
are performed (e.g., SGD, Adam).
​ Regularization Hyperparameters:
● Regularization techniques prevent overfitting by adding
penalties to the loss function.
● Examples:
● L1 and L2 regularization strengths: Control the amount of
penalty applied to the weights.
● Dropout rate: Probability of dropping neurons during
training to prevent overfitting.
​ Initialization Hyperparameters:
● Initialization techniques set the initial values of the weights and
biases in the neural network.
● Examples:
● Weight initialization schemes: Choices like random,
Xavier/Glorot, He initialization.
● Bias initialization: Setting biases to zero or small
positive/negative values.
​ Activation Function Hyperparameters:
● Activation functions introduce non-linearity to the network and
influence the flow of information.
● Examples:
● Type of activation function: Choices like ReLU, sigmoid,
tanh, etc.
● Parameters of the activation functions: For instance, the
slope of Leaky ReLU.
​ Learning Rate Schedule Hyperparameters:
● These hyperparameters control how the learning rate changes
during training.
● Examples:
● Learning rate decay: Gradually decreasing the learning
rate over epochs.
● Learning rate annealing: Adjusting the learning rate
based on a pre-defined schedule.
​ Batch Normalization and Regularization Hyperparameters:
● These hyperparameters are specific to techniques like batch
normalization and dropout.
● Examples:
● Batch normalization hyperparameters: Adjusting
parameters like momentum and epsilon.
● Drop rate for dropout: Controlling the probability of
dropping neurons.

10. What is vanishing gradient problem? How to identify the problem and list
the solutions avoid it?
The vanishing gradient problem is a challenge that occurs during the training
of deep neural networks, particularly networks with many layers. It refers to
the situation where the gradients of the loss function with respect to the
network's parameters become very small as they are propagated backward
through the layers during the training process. This leads to slow
convergence, difficulty in updating earlier layers' parameters, and impedes the
learning process.

Identifying the Vanishing Gradient Problem:

You can identify the vanishing gradient problem through the following
observations during training:

​ Slow Convergence: The loss function decreases very slowly during


training, making the learning process time-consuming.
​ Small Gradient Magnitudes: The gradients of the early layers (closer to
the input) become extremely small, indicating that the network
struggles to learn from the data effectively.
​ Stagnant or Oscillating Training: The training process may appear to
stagnate, and the network may struggle to generalize to the validation
or test data.
​ Ineffective Learning in Early Layers: The weights of early layers hardly
change, implying that they are not being updated effectively.

Solutions to Avoid the Vanishing Gradient Problem:

Several techniques have been developed to mitigate the vanishing gradient


problem and enable the successful training of deep neural networks:

​ Activation Functions:
● Avoid using activation functions that squash the output into a
small range (e.g., sigmoid or tanh). Instead, use non-linear
activation functions like ReLU, Leaky ReLU, or ELU that allow
gradients to flow more freely for positive inputs.
​ Batch Normalization:
● Batch normalization normalizes the activations in each layer,
helping in maintaining reasonable gradient magnitudes
throughout the network. It can significantly mitigate the
vanishing gradient problem.
​ Skip Connections and Residual Networks (ResNets):
● Skip connections, introduced in ResNets, allow information to
bypass one or more layers. This helps gradients flow more
directly and can alleviate the vanishing gradient problem.
​ Gradient Clipping:
● Gradient clipping involves capping the gradients during training
to prevent them from becoming too small. This helps in
controlling the magnitude of gradients and ensures smoother
learning.
​ Weight Initialization:
● Use appropriate weight initialization techniques, such as He
initialization, which can help in balancing the scale of activations
and gradients.
​ Use Shallow Networks:
● If you suspect the vanishing gradient problem is severe,
consider using shallower networks or architectures with fewer
layers to reduce the depth-related issues.
​ Learning Rate Scheduling:
● Gradually reducing the learning rate during training can help
avoid extremely small gradients by slowing down the learning
process.
​ Rescale Inputs:
● Rescale input features to be closer to the zero mean and unit
variance. This can prevent the activations from becoming too
small.
​ Use LSTM and GRU Architectures:
● If you're dealing with sequential data, Long Short-Term Memory
(LSTM) and Gated Recurrent Unit (GRU) architectures are
designed to mitigate vanishing gradient problems in recurrent
networks.
​ Attention Mechanisms:
● Attention mechanisms, especially in natural language
processing tasks, can help the network focus on relevant parts
of the input sequence, reducing the impact of vanishing
gradients.
11. What are optimizaion algorithms? Explain gradient descent. (Notes)
12. Explain the following optimization algorithms: (Any 2)
○ Gradient Descent
○ Stochastic Gradient Descent
○ Mini-Batch Gradient Descent

1. Gradient Descent:

Gradient Descent is an optimization algorithm used to minimize a loss


function by iteratively adjusting the model's parameters in the direction that
reduces the loss. It computes the gradient of the loss function with respect to
the parameters and updates the parameters by taking steps proportional to
the negative gradient.

Algorithm Steps:

​ Initialize the model's parameters.


​ Calculate the loss by comparing predicted values to actual targets.
​ Compute the gradient of the loss with respect to each parameter using
backpropagation.
​ Update each parameter using the gradient and a learning rate (α),
scaled by the gradient's magnitude.

Advantages and Disadvantages:

● Advantages: Simple, widely used, can converge to a global minimum if


the loss function is convex.
● Disadvantages: Computationally expensive for large datasets, can
converge slowly in areas with shallow gradients or saddle points.

2. Stochastic Gradient Descent (SGD):

Stochastic Gradient Descent is an optimization variant that updates


parameters using the gradient of the loss function for a single randomly
chosen training sample at each iteration. This introduces randomness and
noise but can lead to faster updates and escape local minima.

Algorithm Steps:

​ Initialize the model's parameters.


​ Randomly select a training sample.
​ Calculate the loss and gradient for the selected sample.
​ Update parameters using the gradient and learning rate.
​ Repeat steps 2 to 4 for multiple iterations.
Advantages and Disadvantages:

● Advantages: Faster updates, can escape local minima, handles large


datasets more efficiently.
● Disadvantages: Noisy updates due to randomness, can lead to
oscillations, might not converge well if learning rate is not well-tuned.

3. Mini-Batch Gradient Descent:

Mini-Batch Gradient Descent combines the benefits of both Gradient Descent


and Stochastic Gradient Descent by updating parameters using the average
gradient of a small batch of randomly chosen training samples. This approach
balances noise and computational efficiency.

Algorithm Steps:

​ Initialize the model's parameters.


​ Split the training dataset into mini-batches.
​ For each mini-batch, calculate the loss and average gradient.
​ Update parameters using the average gradient and learning rate.
​ Repeat steps 3 and 4 for multiple iterations over all mini-batches.

Advantages and Disadvantages:

● Advantages: Efficient and less noisy updates compared to SGD,


handles large datasets well.
● Disadvantages: May require tuning the batch size, still not guaranteed
to converge to a global minimum.

13. What is regularization? List the popular regularization techniques and


explain any two.
Regularization is a set of techniques used to prevent overfitting in machine
learning models, particularly in deep neural networks. Overfitting occurs when
a model learns to perform well on the training data but fails to generalize to
new, unseen data. Regularization methods aim to introduce additional
constraints or penalties to the training process, discouraging the model from
fitting noise or irrelevant patterns in the data.

Popular Regularization Techniques:

​ 1. L1 Regularization (Lasso):
● Adds a penalty term to the loss function proportional to the
absolute values of the model's parameters.
● Encourages sparsity in the parameter values, leading to some
parameters being exactly zero.
● Useful for feature selection, as it tends to set less important
features' weights to zero.
● Loss Function: L(θ) + λ * Σ|θ|, where L(θ) is the original loss, λ is
the regularization strength, and θ represents model parameters.
​ 2. L2 Regularization (Ridge):
● Adds a penalty term to the loss function proportional to the
squared values of the model's parameters.
● Encourages small parameter values, but does not force them to
exactly zero.
● Helps in preventing overfitting and stabilizing the training
process.
● Loss Function: L(θ) + λ * Σθ^2, where L(θ) is the original loss, λ
is the regularization strength, and θ represents model
parameters.
​ 3. Dropout:
● A regularization technique specific to neural networks.
● During training, randomly "drops out" (sets to zero) a proportion
of neurons in a layer with a specified probability.
● Forces the network to learn robust features that do not depend
heavily on any single neuron.
● Essentially trains multiple sub-networks in parallel, improving
generalization.
● Not used during inference; instead, the weights of the retained
neurons are scaled to account for the dropped-out neurons.

Two Regularization Techniques Explained in Detail:

1. L1 Regularization (Lasso):

L1 regularization encourages sparsity in the model's parameters. This means


it forces some of the parameter values to be exactly zero, effectively
performing feature selection during training. The loss function is modified by
adding a term proportional to the sum of the absolute values of the
parameters.

Benefits:

● Feature selection: L1 regularization can automatically select relevant


features by setting irrelevant feature weights to zero.
● Simplicity: It can lead to simpler and more interpretable models by
reducing the number of non-zero parameters.
Challenges:

● Hyperparameter tuning: The value of the regularization strength λ


needs to be carefully chosen to balance the trade-off between fitting
the data and penalizing the parameters.

2. L2 Regularization (Ridge):

L2 regularization encourages small parameter values without forcing them to


be exactly zero. It adds a term proportional to the sum of the squared values
of the parameters to the loss function.

Benefits:

● Smoothness: L2 regularization can lead to smoother weight values,


which can help prevent large variations between parameter values.
● Improved generalization: By discouraging large parameter values, L2
regularization contributes to improved generalization on unseen data.

Challenges:

● Less feature selection: L2 regularization does not perform feature


selection as aggressively as L1 regularization. It tends to keep all
features in the model but with smaller magnitudes.

14. Compare:
a. Drop out and Drop connect
b. L1 and L2

Dropout vs. DropConnect:

1. Dropout:

● Dropout is a regularization technique introduced by Geoffrey Hinton


and his collaborators. It's primarily used in neural networks, especially
deep neural networks.
● During each training iteration, a fraction (dropout rate) of neurons in a
layer is randomly selected and "dropped out." This means their outputs
are set to zero for that iteration.
● The remaining active neurons' outputs are scaled by the inverse of the
dropout rate to compensate for the dropped-out neurons.
● This process creates a form of ensemble learning, as the network
trains on various sub-networks with different neuron configurations.
● During inference (testing or predictions), dropout is turned off, but the
retained neuron outputs are scaled by the dropout rate to ensure
consistent magnitudes.

2. DropConnect:

● DropConnect is a generalization of dropout that extends the idea to


connections (weights) between neurons, rather than applying dropout
to neurons themselves.
● Similar to dropout, during each training iteration, a fraction of the
weights in the network are randomly set to zero.
● The process of randomly setting weights to zero and then scaling the
retained weights encourages the network to learn from different
sub-networks in each iteration.
● DropConnect introduces randomness at the connection level, affecting
the network's learning process.

Comparison:

● Both Dropout and DropConnect introduce randomness to the network,


effectively creating an ensemble of sub-networks, leading to better
generalization and reduced overfitting.
● Dropout focuses on neurons, while DropConnect focuses on the
individual weights or connections.
● Dropout tends to influence the network's capacity more significantly
since it affects the number of active neurons, whereas DropConnect
primarily affects the strengths of connections.
● DropConnect can be more computationally expensive due to the need
to handle individual weights.

L1 Regularization vs. L2 Regularization:

1. L1 Regularization (Lasso):

● L1 regularization, also known as Lasso (Least Absolute Shrinkage and


Selection Operator), involves adding a penalty term to the loss function
proportional to the sum of the absolute values of the model's
parameters.
● This encourages many parameter values to become exactly zero,
leading to sparse models where some features are effectively
excluded.
● L1 regularization is well-suited when you suspect that many features
are irrelevant or redundant.

2. L2 Regularization (Ridge):

● L2 regularization, also known as Ridge regularization, involves adding a


penalty term to the loss function proportional to the sum of the
squared values of the model's parameters.
● L2 regularization encourages small parameter values without forcing
them to be exactly zero. It smooths out parameter values.
● It's particularly effective in preventing the model from relying too
heavily on a small number of features.

Comparison:

● Both L1 and L2 regularization add penalties to the loss function based


on parameter magnitudes to prevent overfitting.
● L1 regularization can perform feature selection as it drives some
parameters to exactly zero, leading to models with fewer active
features.
● L2 regularization retains all features but discourages large parameter
values, which helps in generalization.
● L1 regularization is more sensitive to outliers due to the absolute value
function, while L2 regularization is less sensitive due to the squared
term.

You might also like