Unit 1 Question and Answers
Unit 1 Question and Answers
Input Layer:
● The input layer serves as the initial point of the network and
represents the input data features.
● Each neuron in the input layer corresponds to a specific feature
of the input data. For instance, in image analysis, each pixel's
intensity could be a feature.
● The number of neurons in the input layer is determined by the
dimensionality of the input data.
Hidden Layers:
● Hidden layers are intermediary layers between the input and
output layers. They allow the network to capture complex
relationships within the data.
● Neurons in the hidden layers process the information received
from the previous layer and apply weighted sums of inputs.
● Each neuron's output is then passed through an activation
function. Common activation functions include the Rectified
Linear Unit (ReLU) and the hyperbolic tangent (tanh) function.
● The number of hidden layers and the number of neurons in each
layer are hyperparameters that can be adjusted based on the
complexity of the problem and the available data.
Output Layer:
● The output layer produces the final results of the network's
computations.
● The number of neurons in the output layer depends on the
specific task. For example, in binary classification, there might
be one neuron for the positive class and one for the negative
class. In multi-class classification, there would be a neuron for
each class.
● The activation function used in the output layer depends on the
nature of the task. For binary classification, a sigmoid function is
commonly used. For multi-class classification, a softmax
function is often employed.
Input Layer:
● The input layer is the initial part of the neural network where raw
data is fed into the system.
● Each neuron in the input layer corresponds to a feature or
attribute of the input data. For example, in an image
classification task, each pixel's intensity could be a feature.
● The number of neurons in the input layer is determined by the
dimensionality of the input data. If you have N features, you'll
have N neurons in the input layer.
Hidden Layers:
● Hidden layers are situated between the input and output layers.
They perform the computational work of transforming the input
data into meaningful representations.
● Neurons within hidden layers apply weighted sums of their
inputs from the previous layer and then pass the result through
an activation function.
● Each hidden layer captures progressively abstract features as
you move deeper into the network. The number of hidden layers
and neurons per layer are design choices that impact the
network's capacity to learn complex patterns.
Weights and Biases:
● Weights represent the strengths of connections between
neurons in different layers. Each connection between neurons
has an associated weight.
● Biases are added to the weighted sum before passing through
the activation function. They allow the network to shift the
output values.
● Both weights and biases are learned during the training process,
aiming to optimize the network's performance on the given task.
Activation Functions:
● Activation functions introduce non-linearity to the neural
network. Without them, the network would behave like a linear
model, unable to capture complex relationships in the data.
● Common activation functions include:
● ReLU (Rectified Linear Unit): Applies zero for negative
inputs and the input itself for positive inputs.
● Sigmoid: Squeezes inputs into a range between 0 and 1.
● Tanh (Hyperbolic Tangent): Squeezes inputs into a range
between -1 and 1.
● Softmax: Used in the output layer of multi-class
classification to convert raw scores into probabilities.
Output Layer:
● The output layer produces the final results of the neural
network's computation.
● The number of neurons in the output layer depends on the task.
For example, for binary classification, there might be one neuron
for each class. For multi-class classification, there would be a
neuron for each class.
● The activation function in the output layer varies depending on
the task. For instance, in binary classification, a sigmoid
function is typically used. In multi-class classification, a softmax
function is commonly applied.
In the labeled diagram provided earlier, each component of the neural network
is visually represented:
If the weighted sum is equal to zero, then bias is added to make the output
non-zero or something else to scale up to the system's response. Bias has the
same input, and weight equals to 1. Here the total of weighted inputs can be
in the range of 0 to positive infinity. Here, to keep the response in the limits of
the desired value, a certain maximum value is benchmarked, and the total of
weighted inputs is passed through the activation function.
The activation function refers to the set of transfer functions used to achieve
the desired output. There is a different kind of the activation function, but
primarily either linear or non-linear sets of functions
➢ Explain the following terms denoting their notations and equations
(where necessary) with respect to deep neural networks:((Any-5)
○ Connection weights and Biases
○ Epoch
○ Layers and Parameters
○ Activation Functions
○ Loss/Cost Functions
○ Learning rate
○ Sample and batch
○ Hyperparameters
●
● The weights and biases are updated during the training process using a
gradient descent algorithm. The goal of the gradient descent algorithm is to
find the weights and biases that minimize the error between the network's
predictions and the desired outputs.
Connection weights and biases are essential for the learning process in deep
neural networks. They allow the network to learn the relationships between its
inputs and outputs. By adjusting the weights and biases, the network can
learn to make more accurate predictions.
Here are some additional things to keep in mind about connection weights
and biases:
● The weights and biases are initialized randomly at the beginning of the
training process.
● The weights and biases are updated during the training process using
a gradient descent algorithm.
● The weights and biases are typically represented as matrices.
● The number of weights and biases in a neural network can be very
large.
● The weights and biases are important for the performance of the neural
network.
2. Epoch
Loss functions measure the discrepancy between the predicted values of the
model and the actual ground truth values. The goal during training is to
minimize the value of the loss function, which essentially means making the
network's predictions as close to the actual values as possible.
Notation:
The learning rate is a hyperparameter that determines the step size at which
the model's parameters (weights and biases) are updated during the
optimization process, such as gradient descent. It's a crucial parameter as it
controls the rate of convergence during training and affects how quickly the
model learns from the data.
Notation:
Samples:
● A sample represents a single input data point (x) and its corresponding
target/output data (y).
● For example, in image classification, a sample could be an image and
its corresponding label.
● In mathematical notation, a single sample can be represented as (x_i,
y_i) where i is the index of the sample.
Batches:
Hyperparameters are settings and values that are set before training a deep
neural network. These parameters are not learned during the training process;
instead, they are chosen by the developer or researcher and significantly
influence the behavior, capacity, and generalization of the neural network.
Notation:
Learning Rate (α): The step size at which the model's parameters are
updated during optimization. Too high a learning rate can lead to
divergence, while too low a learning rate can lead to slow convergence.
Batch Size (n): The number of samples in each training batch. A larger
batch size can lead to smoother parameter updates and more efficient
computation but might require more memory.
Number of Hidden Units/Layers: The depth and width of the network.
More hidden units and layers can increase the network's capacity to
learn complex patterns, but may also lead to overfitting.
Activation Functions: The non-linear functions applied to neuron
outputs. Different activation functions affect how information flows
through the network.
Regularization Strength: Hyperparameters like L1 and L2 regularization
control the prevention of overfitting by adding penalties to the loss
function based on the magnitudes of parameters.
Dropout Rate: The probability of dropping neurons during training to
reduce overfitting. It introduces randomness by randomly setting some
neurons to zero during each iteration.
Working of Backpropagation:
Backpropagation Algorithm:
Types of Backpropagation
There are two types of backpropagation networks.
● Static backpropagation: Static backpropagation is a network designed
to map static inputs for static outputs. These types of networks are
capable of solving static classification problems such as OCR (Optical
Character Recognition).
● Recurrent backpropagation: Recursive backpropagation is another
network used for fixed-point learning. Activation in recurrent
backpropagation is feed-forward until a fixed value is reached. Static
backpropagation provides an instant mapping, while recurrent
backpropagation does not provide an instant mapping.
(Diagram)
Activation functions determine whether a neuron should be activated (fire) or
not based on the weighted sum of its inputs. They introduce non-linearity to
the network, enabling it to model and learn complex patterns in the data.
➢ Sigmoid Function:
○ Formula: σ(x) = 1 / (1 + exp(-x))
○ Range: (0, 1)
○ S-shaped curve, used in the past for binary classification. Can
suffer from vanishing gradient problem for very high or low
inputs.
➢ Hyperbolic Tangent (Tanh) Function:
○ Formula: tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))
○ Range: (-1, 1)
○ Similar to the sigmoid but zero-centered, which mitigates the
vanishing gradient problem to some extent.
➢ Rectified Linear Unit (ReLU):
○ Formula: ReLU(x) = max(0, x)
○ Range: [0, ∞)
○ Widely used due to its simplicity and effectiveness. Addresses
vanishing gradient and helps with training deeper networks.
➢ Leaky ReLU:
○ Formula: LeakyReLU(x) = x if x > 0, else LeakyReLU(x) = α * x
where α is a small positive constant.
○ Range: (-∞, ∞)
○ Addresses the "dying ReLU" problem by allowing small
gradients for negative inputs.
➢ Parametric ReLU (PReLU):
○ Similar to Leaky ReLU but α is learned during training instead of
being a fixed constant.
➢ Exponential Linear Unit (ELU):
○ Formula: ELU(x) = x if x > 0, else ELU(x) = α * (exp(x) - 1) where
α is a positive constant.
○ Smooth transition around zero, which helps mitigate vanishing
gradient and allows negative values without loss of information.
Leaky ReLU:
● Formula: LeakyReLU(x) = x if x > 0, else LeakyReLU(x) = α * x where α
is a small positive constant.
● Range: (-∞, ∞)
● Pros:
● Addresses the "dying ReLU" problem by allowing small
gradients for negative inputs.
● Maintains the sparsity-promoting properties of ReLU.
● Cons:
● Introduces a new hyperparameter α which needs to be tuned.
● May not perform as well as other activations for some tasks.
Comparison:
Advantages:
● ReLU: Simple, computationally efficient, addresses vanishing
gradient for positive inputs.
● Leaky ReLU: Addresses dying ReLU, maintains sparsity.
● ELU: Addresses dying ReLU, provides smooth transition, better
generalization.
Disadvantages:
● ReLU: Can suffer from dying ReLU, not well-suited for negative
inputs.
● Leaky ReLU: Introduces a hyperparameter, may not perform as
well for all tasks.
● ELU: Slightly more computationally expensive.
Dying Neurons:
● ReLU: Prone to dying neurons for negative inputs.
● Leaky ReLU: Less prone to dying neurons due to non-zero
gradient for negative inputs.
● ELU: Non-zero gradient for all inputs, effectively eliminates
dying neurons.
Smoothness:
● ReLU: Not smooth at zero.
● Leaky ReLU: Not smooth at zero, but continuous.
● ELU: Smooth everywhere, including zero.
Sparsity:
● ReLU: Promotes sparsity due to zero activation for negative
inputs.
● Leaky ReLU: Maintains sparsity-promoting property of ReLU.
● ELU: Less sparsity-promoting due to non-zero values for
negative inputs.
Categories of Hyperparameters:
10. What is vanishing gradient problem? How to identify the problem and list
the solutions avoid it?
The vanishing gradient problem is a challenge that occurs during the training
of deep neural networks, particularly networks with many layers. It refers to
the situation where the gradients of the loss function with respect to the
network's parameters become very small as they are propagated backward
through the layers during the training process. This leads to slow
convergence, difficulty in updating earlier layers' parameters, and impedes the
learning process.
You can identify the vanishing gradient problem through the following
observations during training:
Activation Functions:
● Avoid using activation functions that squash the output into a
small range (e.g., sigmoid or tanh). Instead, use non-linear
activation functions like ReLU, Leaky ReLU, or ELU that allow
gradients to flow more freely for positive inputs.
Batch Normalization:
● Batch normalization normalizes the activations in each layer,
helping in maintaining reasonable gradient magnitudes
throughout the network. It can significantly mitigate the
vanishing gradient problem.
Skip Connections and Residual Networks (ResNets):
● Skip connections, introduced in ResNets, allow information to
bypass one or more layers. This helps gradients flow more
directly and can alleviate the vanishing gradient problem.
Gradient Clipping:
● Gradient clipping involves capping the gradients during training
to prevent them from becoming too small. This helps in
controlling the magnitude of gradients and ensures smoother
learning.
Weight Initialization:
● Use appropriate weight initialization techniques, such as He
initialization, which can help in balancing the scale of activations
and gradients.
Use Shallow Networks:
● If you suspect the vanishing gradient problem is severe,
consider using shallower networks or architectures with fewer
layers to reduce the depth-related issues.
Learning Rate Scheduling:
● Gradually reducing the learning rate during training can help
avoid extremely small gradients by slowing down the learning
process.
Rescale Inputs:
● Rescale input features to be closer to the zero mean and unit
variance. This can prevent the activations from becoming too
small.
Use LSTM and GRU Architectures:
● If you're dealing with sequential data, Long Short-Term Memory
(LSTM) and Gated Recurrent Unit (GRU) architectures are
designed to mitigate vanishing gradient problems in recurrent
networks.
Attention Mechanisms:
● Attention mechanisms, especially in natural language
processing tasks, can help the network focus on relevant parts
of the input sequence, reducing the impact of vanishing
gradients.
11. What are optimizaion algorithms? Explain gradient descent. (Notes)
12. Explain the following optimization algorithms: (Any 2)
○ Gradient Descent
○ Stochastic Gradient Descent
○ Mini-Batch Gradient Descent
1. Gradient Descent:
Algorithm Steps:
Algorithm Steps:
Algorithm Steps:
1. L1 Regularization (Lasso):
● Adds a penalty term to the loss function proportional to the
absolute values of the model's parameters.
● Encourages sparsity in the parameter values, leading to some
parameters being exactly zero.
● Useful for feature selection, as it tends to set less important
features' weights to zero.
● Loss Function: L(θ) + λ * Σ|θ|, where L(θ) is the original loss, λ is
the regularization strength, and θ represents model parameters.
2. L2 Regularization (Ridge):
● Adds a penalty term to the loss function proportional to the
squared values of the model's parameters.
● Encourages small parameter values, but does not force them to
exactly zero.
● Helps in preventing overfitting and stabilizing the training
process.
● Loss Function: L(θ) + λ * Σθ^2, where L(θ) is the original loss, λ
is the regularization strength, and θ represents model
parameters.
3. Dropout:
● A regularization technique specific to neural networks.
● During training, randomly "drops out" (sets to zero) a proportion
of neurons in a layer with a specified probability.
● Forces the network to learn robust features that do not depend
heavily on any single neuron.
● Essentially trains multiple sub-networks in parallel, improving
generalization.
● Not used during inference; instead, the weights of the retained
neurons are scaled to account for the dropped-out neurons.
1. L1 Regularization (Lasso):
Benefits:
2. L2 Regularization (Ridge):
Benefits:
Challenges:
14. Compare:
a. Drop out and Drop connect
b. L1 and L2
1. Dropout:
2. DropConnect:
Comparison:
1. L1 Regularization (Lasso):
2. L2 Regularization (Ridge):
Comparison: