ANN Viva Prep
ANN Viva Prep
Write a Python program to plot a few activation functions that are being used in
neural networks.
Explanation:
This Python program uses the numpy and matplotlib libraries to plot several common
activation functions used in neural networks. Here's a breakdown:
1. Import Libraries:
o numpy is used for numerical operations, especially for creating the input
data (x) and performing mathematical calculations.
▪ ELU (Exponential Linear Unit): Like ReLU for x > 0, but uses an
exponential function for x < 0.
o Each function takes an input x (or an array of inputs) and returns the
corresponding activation value(s).
4. Plotting:
o plt.figure(figsize=(12, 8)) creates a figure to hold the plots, setting the size
for better viewing.
o plt.subplot(rows, cols, index) divides the figure into a grid and selects a
specific subplot to draw on.
▪ plt.title(), plt.xlabel(), and plt.ylabel() set the title and axis labels for
the subplot.
Real-World Significance:
Activation functions are fundamental to neural networks and deep learning. They
introduce non-linearity, which allows neural networks to learn complex patterns in data.
Without them, a neural network would simply be a linear regression model, no matter
how many layers it had. Here are some examples of their significance:
The choice of activation function can significantly impact the performance of a neural
network. Different activation functions have different properties that make them
suitable for different tasks and network architectures.
o A linear function has a constant slope (its graph is a straight line), while a
non-linear function's slope varies.
o ReLU, Sigmoid, Tanh, Leaky ReLU, and ELU are common ones.
o ReLU returns the input directly if it's positive, and zero otherwise. f(x) =
max(0, x)
o Disadvantages: Can suffer from the "dying ReLU" problem (neurons can
get stuck outputting zero).
o The sigmoid function squashes values between 0 and 1. It's often used in
the output layer for binary classification problems.
8. What is the tanh function? How does it compare to the sigmoid function?
9. What is the "vanishing gradient" problem, and how can ReLU help alleviate
it?
10. What is Leaky ReLU, and how does it address the dying ReLU problem?
o Leaky ReLU allows a small, non-zero output for negative inputs, which
helps prevent neurons from getting stuck in an inactive state.
12. How do you choose an activation function for a specific task? * The choice
depends on the nature of the problem, the network architecture, and empirical
experimentation. ReLU and its variants are often a good starting point, especially
for CNNs. Sigmoid is suitable for binary classification.
13. Can you plot other activation functions? * Yes, you can easily add more
activation functions to the code (e.g., variations of ReLU).
14. What is the significance of the slope of the activation function? * The slope
(or derivative) is crucial during backpropagation, as it determines how much the
weights are updated. A larger slope can lead to faster learning, but also
instability, while a small slope can lead to slow learning.
import networkx as nx
graph = nx.DiGraph()
graph.add_node('input_1', layer='input')
graph.add_node('input_2', layer='input')
graph.add_node('output_1', layer='output')
pos = {'input_1': (0, 0.5), 'input_2': (0, -0.5), 'output_1': (1, 0)}
plt.show()
Explanation:
• This uses NetworkX to show a simple neural network diagram with two input
nodes and one output.
• Formula:
Range: (0, 1)
import numpy as np
import networkx as nx
def mcculloch_pitts(inputs,weights,threshold):
weighted_sum=np.dot(inputs,weights)
def get_user_inputs():
inputs = []
weights = []
return inputs,weights,threshold
def visualize_network(inputs,weights,threshold,result):
G=nx.DiGraph()
G.add_node("Neuron", pos=(2,1))
G.add_node(f"Output :{result}",pos=(4,1))
pos=nx.get_node_attributes(G,'pos')
plt.figure(figsize=(8,4))
nx.draw(G,pos,with_labels=True,node_size=1500,node_color='skyblue',font_size=10,fo
nt_weight='bold',arrows=True,arrowsize=20)
edge_labels={("Input A","Neuron"):f'W={weights[0]}',
("Input B","Neuron"):f'W={weights[1]}'}
nx.draw_networkx_edge_labels(G,pos,edge_labels=edge_labels)
plt.show()
#Main Function
def main():
inputs,weights,threshold=get_user_inputs()
result=mcculloch_pitts(inputs,weights,threshold)
result = mcculloch_pitts(inputs,weights,threshold)
print(f"\nInputs:{inputs}")
print(f"Weights:{weights}")
print(f"Threshold:{threshold}")
print(f"\nOutput: {result}")
visualize_network(inputs,weights,threshold,result)
if __name__ == "__main__":
main()
I. Conceptual Understanding
o A: This code implements the "AND NOT" logic function (also known as A
AND (NOT B) or A ∧ ¬B). It outputs 1 only when input A is 1 AND input B is 0.
o A:
o A:
o A:
1. It takes three arguments: inputs (a list of two binary values), weights (a list of two
numerical values), and threshold (a single numerical value).
2. It initializes weighted_sum to 0.
3. It iterates through the inputs list using a for loop. In each iteration:
4. After processing all inputs, it checks if the weighted_sum is greater than or equal
to the threshold.
• Q: How would you modify the code to implement a different logic function
(e.g., OR, NAND)?
• Q: What libraries are used in this code and what are they used for?
o A:
• Q: What data types are used for inputs, weights, and threshold?
o A:
▪ Weights: Floats (to allow for more granular control over the
influence of inputs, including negative values).
▪ Threshold: Float (for consistency with weights and to allow for non-
integer threshold values).
o A: The weights are applied to the inputs through multiplication. Each input
value is multiplied by its corresponding weight. These products are then
summed together to calculate the weighted_sum.
• Q: Can the McCulloch-Pitts neuron learn? If not, how can we make it learn?
o To make it learn, you would need to introduce a learning algorithm that can
adjust the weights and threshold based on training data. A simple example
is the Perceptron learning rule, which iteratively updates the weights based
on the difference between the predicted output and the desired output.
1. Foundational Significance
• Historical Precursor: Its primary significance lies in its historical role. It was one
of the first attempts to formalize how a neuron might work. It laid the groundwork
for the development of more complex artificial neural networks. Think of it as the
"Model T" of neural networks – not practical for everyday use now, but essential
for the evolution of modern cars.
2. Relevance to Modern AI
• Teaching Tool: It's extremely valuable for teaching the basic principles before
diving into the complexities of modern deep learning. It helps to build a strong
foundation.
• Feature Detection: In a very rudimentary sense, the weights can be seen as a way
to detect specific "features" in the input. A high weight means the neuron is very
sensitive to that particular input.
• Digital Circuits: The logic gate implementation is directly related to the way digital
circuits work in computers. McCulloch-Pitts neurons showed a theoretical link
between neural computation and computation in general.
In Summary
The real-life significance of this code is primarily educational and historical. It's a
stepping stone to understanding the far more powerful and complex neural networks that
drive much of modern AI. It's not about the direct applications of the "AND NOT" function
in this simple neuron, but about grasping the fundamental principles that make neural
computation possible.
Practical No.3
Write a Python Program using Perceptron Neural Network to recognise even and odd
numbers. Given numbers are in ASCII form 0 to 9
• What is a Perceptron?
o Bias: A constant term that allows the Perceptron to shift the decision
boundary.
o Significance: Weights and bias are learned during training and define the
decision boundary that separates the classes.
o Unit step function. Because it's suitable for binary classification problems
like even/odd, producing a clear 0 or 1 output.
o No. Perceptrons are linear classifiers and can only learn linearly separable
patterns.
o self.weights += update * x_i and self.bias += update adjust the weights and
bias in the direction that reduces the error.
o A small learning rate can lead to slow convergence, while a large learning
rate can cause instability or prevent convergence.
o In this case, it's a direct check by printing the predictions for each digit. For
more complex scenarios, you'd use metrics like accuracy, precision, recall,
etc.
• How would you modify the code to handle a larger set of input characters
(e.g., letters)?
o You would need to expand the digits dictionary to include the ASCII
representations for those characters.
• Could you use a different activation function? What would be the impact?
o Other activation functions like sigmoid or ReLU are more common in multi-
layer networks. For this simple binary classification with a Perceptron, the
unit step function is suitable. Using sigmoid would require adjusting the
output interpretation (probabilities instead of hard 0/1).
• How would you improve the accuracy of your Perceptron if it's not performing
well?
• Real-world:
3. Binary Classification
4. Supervised Learning
• The Perceptron learns from labeled data (the ASCII representations and their
corresponding even/odd labels). This is supervised learning.
However, this program provides a valuable starting point for understanding how neural
networks can learn to extract meaningful information from data and make predictions.
Building on these fundamentals, we can develop more powerful and versatile neural
network models for a wide range of real-world applications.
Practical No.4
With a suitable example demonstrate the perceptron learning law with its decision
regions using python. Give the output in graphical form.
• Perceptron Class:
o The __init__ method initializes the learning rate (lr), the number of iterations
(n_iters), and the activation function (unit step function). It also initializes
weights and bias.
o The predict method predicts the output for new input data using the
learned weights and bias.
o The code creates a simple dataset for the AND logic gate with inputs X and
outputs y.
o It creates a Perceptron object with a learning rate of 0.1 and trains it on the
AND data.
o The code generates a scatter plot of the input data points, color-coded
according to their class labels.
o It calculates the decision boundary line using the learned weights and bias.
o The plot shows how the perceptron separates the two classes (0 and 1) in
the AND gate problem.
2. Graphical Output
The plot shows the decision boundary (a red line) that the perceptron has learned to
separate the two classes of the AND gate.
• The black dots represent the input points (0, 0), (0, 1), and (1, 0), which belong to
class 0.
• The yellow dot represents the input point (1, 1), which belongs to class 1.
• The red line is the decision boundary. Points on one side of the line are classified
as one class, and points on the other side are classified as the other class.
• In this case, you can see that the perceptron has successfully found a line that
separates the input (1,1) from the other three inputs.
Here are some potential viva questions and answers based on the code and the AND gate
example:
• Q: What is a perceptron?
o A: The bias term allows the perceptron to shift the decision boundary,
providing more flexibility in classification. It's like an intercept in a linear
equation.
Code-Specific Questions
• Q: What is the learning rate in this code, and how does it affect training?
o A: The learning rate is 0.1. It controls the step size at which the weights and
bias are updated during each iteration of the training process. A smaller
learning rate might lead to slower convergence but could also prevent
overshooting the optimal solution.
o A: The fit method is used to train the perceptron model. It takes the input
data X and the target labels y as arguments and updates the weights and
bias of the perceptron iteratively until it learns to classify the training data
correctly.
o A: The decision boundary is a line (in 2D) where the weighted sum of the
inputs plus the bias is equal to zero: w1*x1 + w2*x2 + bias = 0. The code
rearranges this equation to solve for x2 (which is y_values in the plot) in
terms of x1 (which is x_values in the plot), the weights, and the bias.
• Q: Can a single-layer perceptron learn the XOR gate? Why or why not?
o A: No, a single-layer perceptron cannot learn the XOR gate because the
XOR function is not linearly separable. You cannot draw a single straight
line to separate the inputs (0, 0) and (1, 1) from the inputs (0, 1) and (1, 0).
4. Real-World Significance
The perceptron, although a simple model, is a fundamental building block in the field of
neural networks and machine learning. Here's its real-world significance:
In summary, while the perceptron itself has limited applications, it is a crucial concept in
the history of artificial intelligence and machine learning, laying the groundwork for the
development of more powerful and versatile neural network models that are used in a
wide variety of applications today.
Practical A7 and B1
B1 Write a python program to show Back Propagation Network for XOR function with
Binary Input and Output
• What are the basic components of a neural network? (Neurons, weights, biases,
activation functions)
• What are some common activation functions? (Sigmoid, ReLU, etc.) Why is the
Sigmoid function used here?
o The Sigmoid function is used in this code because the XOR problem is a
binary classification problem (output is either 0 or 1), and the sigmoid
function outputs values between 0 and 1, which can be interpreted as
probabilities.
• What are weights and biases? How do they affect the output of a neuron?
o Biases allow the neuron to shift its activation function, which helps it to
learn patterns that don't necessarily pass through the origin. They allow the
neuron to activate even when all inputs are zero.
o Weights and biases are the parameters that the neural network learns
during training to map inputs to the correct outputs.
o Input Layer: Receives the raw data that is fed into the neural network.
o Output Layer: Produces the final result or prediction of the neural network.
o Forward propagation is the process of feeding the input data through the
neural network to generate a prediction. The input data is multiplied by the
weights, added to the biases, and passed through activation functions at
each layer, from the input layer to the hidden layer(s) and finally to the
output layer.
• In the code, explain how the input data flows through the network to produce an
output.
o
5. Finally, the result is passed through the sigmoid activation function
to produce the network's final output (output_layer_output).
▪ Addition of biases.
o Weights and biases are the parameters of the neural network. In forward
propagation, the weights determine how much each input contributes to
the neuron's activation, and the biases allow the neuron to shift its
activation threshold.
o The output of the hidden layer is the result of applying the activation
function to the weighted sum of the inputs plus the bias. In the code, this
is hidden_layer_output.
o The output of the output layer is the network's final prediction, which is the
result of applying the activation function to the weighted sum of the hidden
layer outputs plus the bias. In the code, this is output_layer_output.
o The activation function is applied to the weighted sum of the inputs (plus
the bias) at each neuron in the hidden and output layers. For example, in
the code, the sigmoid() function is applied to hidden_layer_input and
output_layer_input.
III. Backpropagation
• What is the role of error or loss function in backpropagation? (Mean Squared Error)
o The error or loss function measures how well the neural network is
performing. It quantifies the difference between the network's predictions
and the actual target values. Backpropagation uses the gradient of this loss
function to update the network's parameters. In this code, the Mean
Squared Error is implicitly used.
o output_error = y - output_layer_output
• Explain the concept of gradient descent. How does the learning rate affect it?
o The learning rate determines the size of these steps. A small learning rate
leads to slow convergence but can help avoid overshooting the minimum.
A large learning rate can lead to faster convergence but may also cause the
algorithm to oscillate or diverge.
o The weights and biases are updated using the following formulas (derived
from gradient descent):
o weights_hidden_output += hidden_layer_output.T.dot(output_delta) *
learning_rate
o weights_input_hidden += X.T.dot(hidden_delta) * learning_rate
• Explain the terms "output delta" and "hidden delta" in the code. What do they
represent?
o These deltas are used to calculate the gradients of the weights and biases.
o If the derivative of the activation function was zero, that would mean that
the neuron's output is not sensitive to changes in its input. In
backpropagation, this would cause the gradient to be zero, and the network
would not be able to learn. This is known as the "vanishing gradient"
problem.
• What is the XOR problem? Why can't a single-layer perceptron solve it?
o The XOR (exclusive OR) problem is a logical problem where the output is 1
if either, but not both, of the inputs is 1. A single-layer perceptron cannot
solve it because the XOR function is not linearly separable; its data points
cannot be separated by a single straight line.
o A multi-layer neural network, with one or more hidden layers, can solve the
XOR problem by learning a non-linear representation of the input data. The
hidden layer(s) transform the input into a higher-dimensional space where
the XOR function is linearly separable.
• In the code, how is the XOR problem represented? (Input and expected output)
o The XOR problem is represented by the input data X and the expected
output y:
o See the detailed comments in the annotated code provided earlier. Each
line is explained in the context of the overall program.
• What libraries are used in the code? (NumPy) Why is NumPy used?
o The code uses the NumPy library. NumPy is used for efficient numerical
computations, especially for handling arrays and matrices. It provides
functions for matrix operations, array creation, and mathematical
functions, which are essential for implementing neural networks.
• What are the dimensions of the weight matrices and bias vectors in the code?
• How are the weights and biases initialized in the code? Why are they initialized
randomly?
o They are initialized randomly to break symmetry and allow the network to
learn different features. If they were all initialized to the same value, all
neurons in a layer would learn the same thing, and the network would not
be able to learn complex patterns.
• What is the learning rate in the code? How does it affect the training process?
o It controls the step size taken to update the weights and biases during each
iteration of backpropagation. A smaller learning rate requires more
iterations but can lead to more accurate convergence. A larger learning rate
can converge faster but may overshoot the optimal solution or cause
instability.
o def sigmoid(x):
o return 1 / (1 + np.exp(-x))
o def sigmoid_derivative(x):
o return x * (1 - x)
▪ Calculates the error term (delta) for the output layer. It multiplies
the difference between the actual output and the predicted output
(output_error) by the derivative of the sigmoid function evaluated at
the output of the output layer.
o weights_hidden_output += hidden_layer_output.T.dot(output_delta) *
learning_rate
o This block of code prints the loss (mean squared error) every 1000 epochs.
It is used to monitor the training progress and check if the network is
learning.
o The final output of the code is the predicted output of the neural network
for the given input data X. It represents the network's approximation of the
XOR function after training. The code also prints the final weights and
biases.
• How can you modify the code to improve the accuracy of the neural network? (e.g.,
by changing the learning rate, number of hidden layers, number of neurons in the
hidden layer, or number of epochs)
• How would you modify this code to work with a different activation function, such
as ReLU?
o To use ReLU, you would replace the sigmoid function and its derivative with
the ReLU function and its derivative in the code. You'd need to define relu()
and relu_derivative() functions and then substitute them in the forward and
backward propagation steps.
• How would you modify this code to handle more than two input variables?
o To handle more than two input variables, you would change the input_size
variable to the number of input variables and adjust the dimensions of the
weights_input_hidden matrix accordingly. The rest of the code would
largely remain the same.
• How would you modify this code to classify data into more than two categories?
o To classify data into more than two categories, you would change the
output_size variable to the number of categories. You would also need to
use a different activation function in the output layer, such as the softmax
function, and a different loss function, such as categorical cross-entropy.
The output y would also need to be represented in a one-hot encoded
format.
o Training Data: The data used to train the neural network, i.e., to adjust its
weights and biases.
o Validation Data: The data used to monitor the network's performance
during training. It helps to tune hyperparameters and prevent overfitting.
o Testing Data: The data used to evaluate the final performance of the
trained neural network on unseen data. It provides an unbiased estimate
of how well the network will generalize to new examples.
• What is overfitting? How can you prevent it? (Regularization, dropout, early
stopping)
o Prevention:
o Address:
• How can you evaluate the performance of a neural network? (Accuracy, precision,
recall, F1-score, etc.)
• What are some techniques for optimizing the training process? (Different
optimizers like Adam, RMSprop)
o Optimization techniques:
• What are some variations of the basic neural network? (e.g., Convolutional Neural
Networks (CNNs), Recurrent Neural Networks (RNNs))
▪ Object detection
▪ Image segmentation
▪ Facial recognition
▪ Speech recognition
▪ Machine translation
▪ Sentiment analysis
▪ Machine translation
▪ Medical diagnosis
▪ Financial modeling
▪ Autonomous driving
▪ Recommender systems
Real-World Significance
1. Neural Networks:
o Broader Applications:
▪ Financial Modeling: They are used for tasks like fraud detection,
risk assessment, and algorithmic trading.
2. XOR Problem:
o Real-world analogy:
In essence, the XOR problem is a foundational example that underpins the development
of neural networks capable of addressing the complexities of the real world.
Practical no 6 B3
B3 Write a python program in python program for creating a Back Propagation Feed-
forward neural network
• Explain the difference between the input layer, hidden layer, and output layer.
o The input layer receives the initial data. Hidden layers perform
intermediate computations. The output layer produces the final result.
• What are weights and biases in a neural network? How are they initialized?
o The sigmoid function is used to squash the values between 0 and 1, making
it suitable for binary classification problems.
Forward Propagation
o It's calculated by taking the dot product of the inputs and the hidden
weights, adding the hidden bias, and then applying the sigmoid activation
function.
Backpropagation
o Calculate the error between the predicted output and the expected output.
o Calculate the derivative of the error with respect to the output layer's
activations.
o Calculate the derivative of the error with respect to the hidden layer's
activations.
o The error is calculated as the difference between the expected output and
the predicted output.
• How are the weights and biases updated? What is the learning rate?
o Weights and biases are updated by subtracting the product of the learning
rate and the error gradient. The learning rate (lr) controls the step size of the
updates.
Code Understanding
o hidden_weights: The weights connecting the input layer to the hidden layer.
o epochs: The number of times the entire training dataset is passed through
the neural network during training.
o lr: The learning rate, which controls the step size for updating weights and
biases.
• Explain the significance of the matrix operations used in the code (e.g.,
np.dot, .T).
o The output of the code represents the final weights and biases after
training, as well as the predicted output of the neural network for the given
input.
o As the number of epochs increases, the neural network refines its weights
and biases, and the predicted output gets closer to the expected output.
• How would you modify the code to change the number of hidden layers or
neurons?
o To change the number of hidden layers, you would need to add more weight
matrices, bias vectors, and activation calculations. To change the number
of neurons in a layer, you would adjust the dimensions of the weight
matrices and bias vectors accordingly.
o It's a very basic network with a single hidden layer, limiting its ability to
learn complex patterns. It might also be prone to overfitting with more
complex datasets.
o Neural networks can be used for various tasks like image and speech
recognition, natural language processing, prediction, and classification.
1. Fundamental Concepts
• Q: What is a Hopfield Network, and how does it differ from other neural
networks?
▪ Neurons: These are the basic processing units of the network. They
have a state, which is typically binary (e.g., +1 or -1, or 1 or 0).
▪ Update Rule: This rule specifies how the neurons update their
states. Updates can be synchronous (all neurons update
simultaneously) or asynchronous (neurons update one at a time).
• Q: What is the significance of the weight matrix (W) in a Hopfield Network, and
how is it constructed?
o A: The weight matrix (W) is the core of a Hopfield Network as it stores the
network's memory. The values of the weights determine which patterns the
network will recognize and retrieve. The weight matrix is typically
constructed using Hebbian learning.
o Hebbian Learning: In its basic form, Hebbian learning states that if two
neurons are active at the same time, the connection between them should
be strengthened. Conversely, if they are active at different times, the
connection should be weakened.
• Q: Why are the diagonal elements of the weight matrix typically set to zero in
a Hopfield Network?
2. Code-Specific Questions
• Q: Explain the purpose of the code snippet that calculates w1, w2, w3, and
w4. Walk through the calculations.
o Step-by-step breakdown:
• Q: What is the purpose of the activate function in the code, and what type of
activation function is it?
▪ If the input x is greater than the threshold theta (defaulting to 0), the
function returns 1.
▪ If the input x is equal to the threshold theta, the function returns the
original value x.
▪ If the input x is less than the threshold theta, the function returns 0.
• Q: What does np.dot(x1, W_rev) calculate, and what is the meaning of the
result?
o Meaning of the result: The resulting vector represents the weighted sum
of the inputs received by each neuron in the network when presented with
the pattern x1. Each element in the resulting vector corresponds to the
input to a neuron, which is then passed through the activation function to
determine the neuron's next state. This calculation is a crucial step in the
iterative process of the Hopfield Network as it evolves towards a stable
state.
• Q: Explain the final if statement in the code. What condition is being checked,
and what are the possible outcomes?
o Possible outcomes:
• Q: In the provided code, the network does not perfectly recall the input
pattern. What are the possible reasons for this?
o A: There are several potential reasons why the Hopfield Network in the
code might fail to perfectly recall the input pattern:
• Q: How could you improve the pattern recall performance of the Hopfield
Network implemented in the code?
8. Create a Neural network architecture from scratch in Python and use it to do multi-
class classification on any data. Parameters to be considered while creating the
neural network from scratch are specified as:
(4) Use more than 1 neuron in the output layer. Use a suitable threshold value Use
appropriate Optimisation algorithm
• Q: Explain the difference between the input layer, hidden layer, and output
layer.
o A:
• Q: What is a CNN?
o A: The convolutional layer uses filters (or kernels) that slide over the input
image, performing element-wise multiplications and summations. This
process extracts local features like edges, textures, and patterns.
o A: A filter (or kernel) is a small matrix of weights that is convolved with the
input data to extract features.
o A: The pooling layer reduces the spatial size of the feature maps,
decreasing the number of parameters and computations. Common
pooling operations are max pooling and average pooling.
o A: Max pooling selects the maximum value from each pooling window,
retaining the most salient features.
• Q: What is the shape of X_train and X_test? What does each dimension
represent?
o A: X_train has a shape of (50000, 32, 32, 3), meaning 50,000 training
images, each of size 32x32 pixels with 3 color channels (RGB). X_test has a
shape of (10000, 32, 32, 3) for the same reasons.
o A: It's a utility function to visualize a sample image from the dataset along
with its corresponding class label.
• Q: What is backpropagation?
• Q: What is the loss function? What loss function is suitable for this
classification problem?
o A: The loss function measures how well the model is performing. For multi-
class classification, "categorical cross-entropy" is commonly used.
• Q: What is accuracy?
o A: Overfitting occurs when a model learns the training data too well and
performs poorly on unseen data. Techniques to prevent it include:
▪ Data augmentation
▪ Dropout
▪ Regularization
▪ Early stopping
• Q: What is dropout?
• Q: Explain the architecture of the neural network you would create from
scratch, given the constraints.
• Q: Why is ReLU a good choice for the activation function in the hidden layers?
• Q: Why is Softmax a suitable activation function for the output layer in this
case?
o If you were to use a sigmoid in the output layer for multi-label classification
(where an instance can belong to multiple classes), then a threshold would
be used to determine which classes are predicted as positive.
2. Implementation Details
• Q: How would you initialize the weights and biases in your neural network?
2. The weighted sum of the inputs and biases is calculated for each neuron in the
hidden layer(s).
5. Finally, the weighted sum is calculated for the neurons in the output layer.
• Q: How would you calculate the error or loss in this multi-class classification
problem?
o A: A suitable loss function is "categorical cross-entropy." It measures the
difference between the predicted probability distribution and the true
distribution of class labels.
1. Calculate the gradient of the loss function with respect to the output layer's
activations.
2. Propagate this gradient back through the network, calculating the gradient of the
loss with respect to the weights and biases of each layer.
o Adam is often a good default choice due to its efficiency and effectiveness.
• Q: How would you update the weights and biases using the chosen
optimization algorithm (e.g., Adam)?
o A: The Adam optimizer adapts the learning rate for each parameter by
calculating an exponentially decaying average of past gradients and
squared gradients. The weights and biases are updated using these
calculated values and the learning rate.
3. Going Deeper
o A: The learning rate controls the step size taken to update the weights
during training. A small learning rate can lead to slow convergence, while a
large learning rate can cause the optimization process to diverge.
o A:
▪ Batch Gradient Descent: Calculates the gradient using the entire
training dataset.
* A:
* Regularization (L1, L2): Add a penalty term to the loss function to discourage large
weights.
* Early Stopping: Monitor the model's performance on a validation set and stop training
when it starts to degrade.
• Q: What are the challenges of training very deep neural networks, and how
can they be addressed?
o A:
▪ Batch normalization