Unit 3 Endsem PYQs
Unit 3 Endsem PYQs
Answer:
Fundamental of Neural Networks and Artificial Neural Networks in Big Data (May-Jun
2023 Q1 a & May-Jun 2024 Q2 b)
Artificial Neural Networks (ANNs), often simply called Neural Networks (NNs), are computing
systems inspired by the structure and function of the human brain. They are designed to
recognize patterns, learn from data, and make predictions or decisions.
Fundamentals:
1. Neurons (Nodes): The basic building blocks of an ANN are artificial neurons, analogous
to biological neurons. Each neuron receives inputs, processes them, and produces an
output.
2. Connections (Synapses) and Weights: Neurons are interconnected, and each
connection has an associated 'weight'. These weights represent the strength or
importance of the connection. During training, the network learns by adjusting these
weights.
3. Activation Function: After summing the weighted inputs, a neuron applies an activation
function (also known as a transfer function). This function introduces non-linearity,
allowing the network to learn complex patterns. Common activation functions include
Sigmoid, ReLU, Tanh, etc.
4. Layers: Neurons are typically organized into layers:
Input Layer: Receives the raw input data.
Hidden Layers: One or more layers between the input and output layers, where the
majority of the computational processing and feature extraction occurs.
Output Layer: Produces the final output of the network, which could be a prediction,
classification, or another form of decision.
5. Learning (Training): ANNs learn from data through a process called training. They are
fed with input data and corresponding desired outputs (labels). The network adjusts its
internal weights and biases to minimize the difference (error or loss) between its
predicted output and the true output. This process typically involves optimization
algorithms like Gradient Descent and Backpropagation.
Neural Networks are particularly well-suited for handling big data due to their ability to:
Extract Complex Patterns: Big data often contains intricate, non-linear relationships
that traditional statistical methods might miss. Deep Neural Networks (with many hidden
layers) can automatically learn hierarchical features from vast amounts of raw data.
Scalability: Modern NN architectures and training algorithms (e.g., mini-batch gradient
descent) combined with distributed computing frameworks (like Apache Spark,
TensorFlow Distributed) and specialized hardware (GPUs, TPUs) allow NNs to scale to
massive datasets.
Feature Learning: Instead of requiring manual feature engineering (which is challenging
with big data's high dimensionality and variety), NNs can learn relevant features directly
from the data, which is crucial when dealing with unstructured big data (images, text,
audio).
Handling Variety: NNs, especially specialized architectures like Convolutional Neural
Networks (CNNs) for images and Recurrent Neural Networks (RNNs) for text/sequences,
can process diverse types of big data.
Predictive Power: With enough data and computational resources, NNs can achieve
state-of-the-art accuracy in tasks like predictive analytics, anomaly detection, and
recommendation systems on big data platforms.
The architecture of an Artificial Neural Network defines how its neurons are organized into
layers and how these layers are connected. The most common and fundamental architecture
is the Feedforward Neural Network (FNN), particularly the Multi-Layer Perceptron (MLP).
1. Input Layer:
Consists of neurons that receive the initial data. Each neuron in the input layer
corresponds to a feature in the dataset.
No processing or activation functions are applied here; they simply pass the input
values to the next layer.
Example: For an image classification task, if the image is 28x28 pixels, the input
layer might have 784 neurons, each representing a pixel's intensity.
2. Hidden Layers:
One or more layers situated between the input and output layers. These layers are
where the network performs complex computations and learns intricate
representations of the input data.
Each neuron in a hidden layer receives inputs from all neurons in the preceding
layer, computes a weighted sum, adds a bias, and then applies an activation
function.
The number of hidden layers and neurons within them are hyperparameters that
need to be tuned. Deeper networks (with more hidden layers) are called "Deep
Neural Networks."
Example: In a sentiment analysis task, a hidden layer might learn to identify
combinations of words that signify positive or negative sentiment.
3. Output Layer:
Contains neurons that produce the final output of the network.
The number of neurons in the output layer depends on the type of problem:
Regression: Typically one neuron (e.g., predicting house price).
Binary Classification: One neuron (e.g., predicting spam/not spam) with a
sigmoid activation.
Multi-class Classification: One neuron per class (e.g., classifying digits 0-9
would have 10 output neurons) with a softmax activation.
The activation function used here depends on the problem type (e.g., Sigmoid for
binary, Softmax for multi-class classification, linear for regression).
4. Connections (Synapses):
Each neuron in one layer is typically connected to every neuron in the subsequent
layer (fully connected or dense layers).
Each connection has an associated numerical weight (w ), which represents the
ij
strength of the connection from neuron i in the previous layer to neuron j in the
current layer.
5. Biases (b ):
j
Each neuron (except input neurons) typically has an associated bias term (b ). j
1+e
−x
), Tanh ( e −e
x
e +e
−x
).
O O O O
O --W--> O --W--> O --W--> O --W--> O
O O O O
O O O O
^ ^ ^ ^
| | | |
Raw Data Feature Extr. Complex Mapping Decision/Output
Flow of Information: In feedforward networks, information flows strictly from the input
layer, through hidden layers, to the output layer, without any loops or cycles.
Learning: The network learns by adjusting the weights (W ) and biases (b) based on the
error between its predictions and the actual target values, typically using optimization
algorithms like Gradient Descent and Backpropagation.
Answer:
3. Weighted Sum: The weighted inputs are summed up, along with a bias term (b). This is
called the net input:
n
net = ∑(x i ⋅ w i ) + b
i=1
4. Activation (Step) Function: The net input is then passed through a step (or threshold)
activation function. For a simple perceptron, this is typically a Heaviside step function:
This function outputs either 1 or 0 (or +1 or -1), classifying the input into one of two
classes.
Perceptron Diagram:
w1
x1 ----O----
w2 \
x2 ----O---- \ SUM + Bias ----> Activation Function ----> Output
... /
xn ----O---- /
wn
Types of Perceptron:
O------------------O------------------O
O------------------O------------------O
O------------------O------------------O
Difference between Linear and Nonlinear Neural Networks? (May-Jun 2023 Q2 b &
May-Jun 2024 Q1 b)
The core distinction between linear and non-linear neural networks lies in their activation
functions and, consequently, their ability to model complex relationships in data.
Feature Linear Neural Network Nonlinear Neural Network
Activation Uses a linear activation function Uses non-linear activation
Function (e.g., f (x) = x) or no activation functions (e.g., Sigmoid, ReLU,
function at all. The output is a direct Tanh, Softmax).
sum of weighted inputs.
Output Type The output is a linear combination The output is a non-linear
of its inputs. transformation of its inputs.
Decision Can only learn linear decision Can learn complex, non-linear
Boundary boundaries (straight lines, planes, decision boundaries (curves,
or hyperplanes). complex shapes).
Problem Suitable only for linearly separable Capable of solving non-linearly
Solving problems. If the data cannot be separable problems, which are
separated by a straight line, it will common in real-world data.
fail.
Complexity Simpler, less powerful. Even More complex, powerful, and
multiple layers of linear activations capable of learning intricate
would still result in an overall linear patterns. Deeper networks (Deep
model (a composition of linear Learning) rely heavily on non-
functions is linear). linearity.
Examples Single-Layer Perceptron (when Multi-Layer Perceptrons (MLPs),
using a simple threshold), Linear Convolutional Neural Networks
Regression models. (CNNs), Recurrent Neural
Networks (RNNs).
Practical Limited to very simple problems. Foundation of most modern AI
Use Rarely used alone for complex and machine learning
tasks. applications due to their ability to
model real-world complexities.
Linear Separation:
x
+ +
+ - - - - - - -
- -
-
(A single straight line can separate '+' from '-')
In essence, the non-linear activation functions are what give neural networks their power to
learn complex, non-trivial relationships in data. Without them, a multi-layer neural network
would simply be equivalent to a single-layer linear model, no matter how many layers it has.
May-Jun 2023 Q1 c) What is feed forward neural network explain with example. [8]
May-Jun 2024 Q1 a) What are the components of a feed forward neural network?
Explain with the help of neat sketch. [6]
Answer:
A Feedforward Neural Network (FNN) is the most basic type of artificial neural network
where the connections between nodes do not form a cycle. Information flows in only one
direction: from the input layer, through any hidden layers, and to the output layer. It's called
"feedforward" because the information is propagated forward through the network. The Multi-
Layer Perceptron (MLP) is a common example of an FNN.
1. Input Layer:
Purpose: Receives the raw input features of the dataset.
Structure: Consists of neurons (nodes), where each neuron corresponds to an input
feature.
Operation: Simply passes the input values to the first hidden layer. No computations
or activation functions are applied here.
2. Hidden Layers:
Purpose: Perform complex computations, extract features, and transform the input
data into more abstract representations. They are responsible for learning the
intricate patterns within the data.
Structure: One or more layers located between the input and output layers. Each
neuron in a hidden layer is connected to all neurons in the previous layer.
Operation: For each neuron j in a hidden layer:
It computes a weighted sum of inputs from the previous layer:
net j = ∑ (w ij ⋅ x i ) + b j
i
3. Output Layer:
Purpose: Produces the final result of the network, which can be a prediction (for
regression) or a classification (for classification tasks).
Structure: Contains neurons that correspond to the desired output. The number of
neurons depends on the task (e.g., 1 for binary classification/regression, N for N-
class classification).
Operation: Similar to hidden layers, it computes a weighted sum and applies an
activation function, which is chosen based on the problem type (e.g., Sigmoid for
binary classification, Softmax for multi-class classification, linear for regression).
4. Weights (W ):
Purpose: Numerical values associated with each connection between neurons. They
represent the strength or importance of the connection.
Role: These are the parameters that the network learns during training. A higher
weight means the input from that connection has a stronger influence on the
receiving neuron.
5. Biases (b):
Purpose: An additional parameter associated with each neuron (except input
neurons). It allows the activation function to be shifted, providing more flexibility for
the model to fit the data.
Role: Acts like an intercept term in linear regression, allowing the neuron to activate
even if all inputs are zero, or to suppress activation if inputs are high but the bias is
very negative.
6. Activation Functions (f ):
Purpose: Introduce non-linearity into the network, enabling it to learn and model
complex, non-linear relationships in data. Without non-linear activations, any multi-
layer FNN would behave like a single-layer linear model.
Placement: Applied after the weighted sum in hidden and output layers.
Explanation of Sketch:
Let's consider classifying handwritten digits (0-9) from grayscale images (e.g., 28x28 pixels).
1. Input Layer:
An image of 28x28 pixels can be flattened into a vector of 28 × 28 = 784 pixel values.
The input layer would have 784 neurons, each representing the intensity of one
pixel.
2. Hidden Layers:
We might have one or two hidden layers, say, the first with 128 neurons and the
second with 64 neurons.
Each neuron in the first hidden layer receives input from all 784 input neurons. It
calculates a weighted sum of these pixel values, adds its bias, and applies a ReLU
activation function.
The outputs of the first hidden layer serve as inputs to the second hidden layer, and
so on. These layers learn to extract increasingly complex features, like edges,
curves, and parts of digits.
3. Output Layer:
Since we are classifying 10 digits (0-9), the output layer would have 10 neurons.
Each output neuron corresponds to one digit class.
A Softmax activation function is typically applied to the output layer. Softmax converts
the raw outputs into probabilities, where the sum of probabilities for all 10 classes
equals 1.
The neuron with the highest probability indicates the predicted digit.
1. Forward Pass: An image (e.g., a '5') is fed into the input layer. The pixel values
propagate forward through the hidden layers, undergoing weighted sums and activation
functions, until a set of 10 probability scores is produced by the output layer (e.g., [0.1,
0.05, ..., 0.8, ..., 0.01] where 0.8 might be for class '5').
2. Loss Calculation: The network's predicted probabilities are compared to the actual label
(e.g., a "one-hot" encoded vector [0,0,0,0,0,1,0,0,0,0] for '5') using a loss function (e.g.,
Cross-Entropy Loss). The loss quantifies how "wrong" the prediction was.
3. Backward Pass (Backpropagation): The calculated loss is then propagated backward
through the network. This process determines how much each weight and bias
contributed to the error.
4. Weight Update (Gradient Descent): Using the calculated gradients, an optimizer (like
Gradient Descent) adjusts the weights and biases slightly to reduce the loss.
5. Iteration: Steps 1-4 are repeated for many images (epochs) until the network learns to
accurately classify digits.
May-Jun 2024 Q2 a) Explain Gradient descent algorithm that is used to train the
neural networks.[6]
May-Jun 2024 Q2 c) How the backpropagation algorithm works? Explain with
suitable example.[8]
Answer:
Explain Gradient Descent Algorithm that is used to train the Neural Networks. (May-
Jun 2024 Q2 a)
Gradient Descent is a widely used iterative optimization algorithm for training neural
networks. Its primary goal is to find the set of weights and biases for the network that
minimize a given loss function (or cost function). The loss function quantifies the error
between the network's predictions and the actual target values.
Core Concept:
Imagine the loss function as a landscape with hills and valleys, where the lowest point (a
valley) represents the minimum loss. Gradient Descent works by iteratively "descending" this
landscape in the steepest possible direction until it reaches a local (or global) minimum. The
"steepest direction" is given by the negative of the gradient of the loss function.
1. Initialization: Start by initializing the network's weights and biases randomly (or with
small values).
2. Calculate Loss: For a given set of input data, perform a forward pass through the
network to obtain predictions. Then, calculate the value of the loss function based on
these predictions and the actual target values.
3. Calculate Gradients: Compute the gradient of the loss function with respect to each
weight and bias in the network. The gradient indicates the direction of the steepest
ascent (maximum increase) of the loss function. We want to move in the opposite
direction.
Mathematically, for a weight w and a loss function J (θ), where θ represents all
j
∂w j
.
4. Update Parameters: Adjust each weight and bias by moving a small step in the direction
opposite to its gradient. The size of this step is controlled by a parameter called the
learning rate (α).
The update rule for a parameter θ is: j
∂J (θ)
θ j new = θ j old − α ⋅
∂θ j
A small learning rate leads to slow but potentially more stable convergence. A large
learning rate can cause oscillations or divergence.
5. Iteration: Repeat steps 2-4 for a specified number of training iterations (epochs) or until
the change in the loss function becomes negligible (convergence).
Analogy:
Think of yourself blindfolded on a mountainous terrain, trying to reach the lowest point (the
valley). You can't see the whole landscape. What you can do is feel the slope under your
feet. To go downhill fastest, you'd take a step in the direction where the slope is steepest
downwards. That's exactly what gradient descent does: it calculates the direction of steepest
ascent (gradient) and takes a step in the opposite direction.
While the core concept is the same, how much data is used to calculate the gradient in each
step leads to different variants:
Batch Gradient Descent (BGD): Calculates the gradient using the entire training
dataset in each iteration. This provides a very accurate gradient but can be
computationally very expensive and slow for big data.
Stochastic Gradient Descent (SGD): Calculates the gradient and updates parameters
for each single training example at a time. This is much faster and can escape local
minima, but the updates are noisy due to high variance.
Mini-Batch Gradient Descent: The most common approach. It calculates the gradient
and updates parameters using a small "mini-batch" of training examples (e.g., 32, 64,
128 samples) at a time. This offers a good balance between the computational efficiency
of SGD and the gradient stability of BGD.
How the Backpropagation Algorithm Works? Explain with suitable example. (May-Jun
2024 Q2 c)
Backpropagation essentially applies the chain rule of calculus to compute gradients layer
by layer, starting from the output layer and moving backward towards the input layer. It
determines how much each weight and bias contributed to the overall error (loss) of the
network.
Steps of Backpropagation:
Let's consider a simple FNN with one hidden layer for illustration:
Input Layer (x) -> Hidden Layer (h) -> Output Layer (y_hat)
1. Forward Pass:
Input data (x) is fed into the network.
Activations are computed layer by layer, from input to output.
For the hidden layer: net h
= x ⋅ Wh + bh , then h = f h
(net h )
Finally, the loss (L) is calculated by comparing y hat with the true target y (e.g., using
Mean Squared Error or Cross-Entropy).
2. Backward Pass (Error Propagation & Gradient Calculation):
Calculate Output Layer Error (δ ): y
First, compute the error derivative with respect to the output layer's activation
and the derivative of the output activation function.
For example, if using MSE loss (L = 1
2
(y − y hat )
2
) and linear output activation (
y hat = net y ), then ∂L
∂y hat
= −(y − y hat ) .
The "error signal" for the output layer is typically: δ = (y − y) ⋅ f (net ) (for y hat
′
y y
function).
Calculate Gradients for Output Layer Weights (W ) and Biases (b ): y y
The gradient of the loss with respect to a weight connecting a hidden neuron h k
∂L
= δ yj ⋅ h k
∂W ykj
∂L
= δ yj
∂b y j
The error from the output layer is propagated backward to the hidden layer.
Each hidden neuron receives an error signal that is a weighted sum of the error
signals from the output neurons it connects to.
δ h_k = (∑
j
′
δ y_j ⋅ W ykj ) ⋅ f (net h k )
h
(where f is the derivative of the hidden
′
h
∂L
= δ hk ⋅ x i
∂W xik
∂L
= δh
k
∂b h
k
∂W
and ) are computed, the optimizer (e.g., Gradient
∂L
∂b
∂L
W new = W old − α ⋅
∂W
∂L
b new = b old − α ⋅
∂b
This process is repeated for many iterations (epochs) over the training data until the
network's performance is satisfactory.
Imagine a simple network trying to classify if an image contains a cat (output 1) or a dog
(output 0).
1. Forward Pass: You feed a cat image. The network, after calculating weighted sums and
activations through its layers, predicts "0.8" (meaning 80% confident it's a cat). The true
label is 1.
2. Loss Calculation: The loss function (e.g., binary cross-entropy) calculates a value
indicating the difference between 0.8 and 1.0. This error is now the target to minimize.
3. Backward Pass:
Output Layer: The backpropagation algorithm starts at the output layer. It asks:
"How much did this output neuron's weight/bias contribute to the error of 0.8 vs 1.0?"
It calculates the derivatives related to the output neuron's contribution.
Hidden Layer: It then propagates this "error signal" backward to the hidden layer.
For each hidden neuron, it asks: "Based on the error from the output layer, how much
did my weights and biases contribute to that error?" It uses the chain rule to
determine this. For instance, if a hidden neuron was crucial for detecting "ears" and
the cat image was misclassified, the error signal will be strong for the weights
connected to that "ear-detecting" neuron.
This "blame assignment" continues backward until the first hidden layer.
4. Weight Update: Gradient Descent uses these calculated blame signals (gradients) to
slightly adjust all the weights and biases in the network. For example, if the "ear-
detecting" neuron's weights were contributing to a wrong classification, they would be
adjusted to better recognize cat ears in the future.
This iterative process of forward pass (prediction), loss calculation, backward pass (gradient
calculation), and parameter update allows the neural network to learn from its mistakes and
improve its accuracy over time.
Answer:
What is Recurrent Neural Network? Explain in detail / with example. (May-Jun 2023 Q2
c & May-Jun 2024 Q1 c)
Recurrent Neural Networks (RNNs) are a class of artificial neural networks designed to
process sequential data, unlike traditional Feedforward Neural Networks (FNNs) that
assume inputs are independent of each other. The "recurrent" aspect comes from the fact
that information from previous time steps is fed back into the network, allowing it to exhibit
temporal dynamic behavior and remember past information. This internal memory makes
RNNs particularly well-suited for tasks involving sequences, such as natural language
processing, speech recognition, and time series prediction.
Core Concept:
An RNN has a "memory" in the form of a hidden state that is updated at each time step. The
hidden state (h ) at time t depends not only on the current input (x ) but also on the hidden
t t
state from the previous time step (h ). This allows the network to maintain context from
t−1
2. Previous Hidden State (h ): The output of the hidden layer from the previous time
t−1
These two components are combined, typically multiplied by their respective weights,
summed, and then passed through an activation function (like Tanh or ReLU) to produce the
current hidden state (h ). The hidden state h can then be used to calculate the output (y )
t t t
The key feature is that the weights associated with the recurrent connections (connecting
ht−1 to h ) are shared across all time steps. This means the same set of parameters is
t
applied to different parts of the sequence, enabling the network to generalize across different
positions in the sequence.
h t = f h (W hh h t−1 + W xh x t + b h )
Output at time t:
y t = f y (W hy h t + b y )
Where:
xt : Input at time t
ht : Hidden state at time t
yt : Output at time t
W hh : Weight matrix for the recurrent connection (hidden to hidden)
W xh : Weight matrix for input to hidden
W hy : Weight matrix for hidden to output
bh , by : Bias vectors
fh , fy : Activation functions (e.g., Tanh for f , Softmax for f for classification)
h y
To understand the flow over time, an RNN is often "unrolled" across time steps:
Each [RNN Cell] box represents the same set of weights (W hh , W xh , W hy , b h , b y ) being
applied at different time steps.
h_(-1) is the initial hidden state (often initialized to zeros).
1. Memory: Can process sequences of arbitrary length by maintaining a hidden state that
implicitly captures information about prior elements in the sequence.
2. Weight Sharing: Uses the same weights across different time steps, which reduces the
number of parameters and makes the model more efficient for sequence data.
3. Variable Length Inputs/Outputs: Can handle input sequences of varying lengths and
produce output sequences of varying lengths (e.g., many-to-one, one-to-many, many-to-
many).
Challenges with Vanilla RNNs:
Long Short-Term Memory (LSTM) Networks: Introduce "gates" (input, forget, output
gates) and a "cell state" to control the flow of information, enabling them to learn long-
term dependencies much more effectively.
Gated Recurrent Units (GRUs): A simpler variant of LSTMs with fewer gates, offering a
good balance between performance and computational efficiency.
1. Input Sequence: The sentence is tokenized into a sequence of words (or word
embeddings). For example: "This movie was absolutely amazing!"
x0 : "This"
x1 : "movie"
x2 : "was"
x3 : "absolutely"
x4 : "amazing!"
2. Processing (Many-to-One):
At each time step t, the RNN cell takes the current word embedding (x ) and the
t
This continues until the end of the sentence. The final hidden state (h for
4
"amazing!") will ideally encode the overall sentiment of the entire sentence,
remembering the cumulative impact of words like "absolutely" and "amazing!".
3. Output:
After processing the entire sequence, the final hidden state (h ) is fed into a
4
classification layer (e.g., a softmax layer) that predicts the sentiment: "Positive",
"Negative", or "Neutral".