0% found this document useful (0 votes)

4 views28 pages

Neural Network 1

The document provides an overview of predictive modeling, machine learning (ML), and deep learning (DL), focusing on their applications in classifying images of dogs and cats. It discusses the differences between ML and DL, the importance of features in classification, and the architecture and learning processes of neural networks. Additionally, it highlights the challenges in classification tasks and the benefits of using deep learning for complex problems with unstructured data.

Uploaded by

arpitshuklaji9919

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views28 pages

Neural Network 1

Uploaded by

arpitshuklaji9919

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Minor in AI

Neural Networks

December 30, 2024

Minor in AI

1 Predictive Modelling
Predictive modelling is about creating a mathematical function that predicts the out-
put (or target) from some input. Think of it as learning a function

f (input) = output

where, the goal is to make f as accurate as possible for unseen data.

Figure 1: Workflow

2 Machine Learning (ML)

Machine Learning is a subset of artificial intelligence (AI) that teaches computers to
learn patterns from data without being explicitly programmed.

2.1 How Does It Work?

1. Given labeled data (e.g., images of animals with the correct labels ”dog” or ”cat”),
ML algorithms learn to map inputs to outputs using mathematical models.
2. Common algorithms: Decision trees, Support Vector Machines (SVM), k-Nearest
Neighbors (kNN), and Random Forests.

3 Deep Learning (DL)

Deep Learning is a specialized subset of Machine Learning that uses neural networks with
multiple layers (hence ”deep”) to learn patterns directly from raw data. Unlike traditional
ML, DL automatically extracts relevant features.

3.1 How Does It Work?

1. Input (e.g., raw images) passes through multiple layers of neurons that process the
data and identify patterns (e.g., recognizing ears, eyes, and tails).

Neural Networks 1
Minor in AI

2. Deep Learning is particularly effective when dealing with unstructured data like
images, audio, and text.

Figure 2: AI Sphere

4 Dog and Cat Classification

4.1 Problem Statment
In this problem, the goal is to develop a machine learning or deep learning model capable
of accurately classifying images of animals into one of two categories: dog or cat. Given
an input image of an animal, the model should analyze the visual features in the image
and predict whether the animal is a dog or a cat.

Figure 3: Dog vs Cat

Neural Networks 2
Minor in AI

4.2 Challenges
This task is challenging due to:

• Similar Features: Dogs and cats may share similar features such as fur patterns,
shapes, or poses, making it difficult for a model to distinguish between them.

• Diverse Conditions: Images can vary widely in terms of lighting, background,

angles, and the sizes of the animals.

4.3 Objective
The objective is to build a classification model that takes an image as input and outputs
a prediction (dog or cat) with high accuracy, even when faced with unseen or diverse
images.

4.4 ML Approach
• In ML, we extract features manually (e.g., the shape of ears, tail, or color patterns).

• These features are then used to train a model (e.g., SVM) to classify the animal as
a dog or a cat.

Figure 4: ML Workflow

4.5 DL Approach
• A DL model like Convolutional Neural Networks (CNNs) can process raw image
pixels, learn to identify unique features (e.g., fur texture or shape of the face), and
classify the image as ”dog” or ”cat.”

Neural Networks 3
Minor in AI

Figure 5: D Workflow

5 Features and Their Importance

5.1 What are Features?
Features are individual measurable properties or characteristics of the data used for
prediction.
In the dog vs. cat problem, features could be the length of ears, shape of the nose, or
pixel intensity values.

5.2 Why are Features Important?

1. Good features make it easier for a model to distinguish between classes.

2. In ML, feature engineering is critical: manually selecting, extracting, or transform-

ing features from the data.

5.3 Feature Learning

Feature learning is a critical component of this classification problem, as it involves
identifying and extracting patterns or characteristics from the images that are most rele-
vant for distinguishing between dogs and cats.

5.3.1 Key Aspects

• Manual Feature Extraction: Traditional machine learning models, such as sup-
port vector machines (SVMs), often rely on handcrafted features, such as edge de-
tection or color histograms. These features are manually designed based on domain
expertise.

• Automatic Feature Extraction: Deep learning models, especially convolutional

neural networks (CNNs), automatically learn hierarchical features from the raw data
during training.

Neural Networks 4
Minor in AI

– Lower layers learn simple features, such as edges and textures.

– Higher layers learn more abstract features, such as shapes, eyes, ears, or tails,
which are specific to dogs or cats.

• Feature Importance: Not all features contribute equally to the classification. For
example:

– The presence of pointy ears might be more relevant for identifying cats.
– A broader face or specific fur patterns might indicate a dog.

Figure 6: Feature Learning

Benefits
• Reduced Dependence on Domain Knowledge: Unlike traditional models,
CNNs can automatically learn relevant features, reducing the need for manual in-
tervention.

• Scalability: Feature learning in deep learning is highly scalable to large and diverse
datasets.

• Improved Accuracy: Automatically learned features are often more effective in

capturing subtle patterns, leading to higher classification accuracy.

6 Modalities and Their Integration

Figure 7: Modalities

Neural Networks 5
Minor in AI

6.1 What are Modalities?

• Modalities refer to different types of data sources. For example:

– Images (e.g., dog and cat photos).

– Text (e.g., a description of the animal).
– Audio (e.g., the sound made by the animal).

6.2 Integration of Modalities

• Combining different modalities can improve prediction.

• Example: For classifying animals, combining image data with text descriptions
could provide additional context to improve accuracy.

7 ML vs. DL

Figure 8: ML vs DL

7.1 Key Differences

• Feature Extraction:

– ML requires manual feature engineering.

– DL automatically extracts features.

• Data Dependency:

– ML works well with small to medium-sized datasets.

– DL needs large datasets to perform effectively.

• Complexity:

– ML algorithms are simpler and faster to train.

– DL models are computationally intensive and require GPUs.

Neural Networks 6
Minor in AI

7.2 Key Takeaway

1. Use ML for simpler problems or when data is limited.

2. Use DL for complex problems, especially with large datasets and unstructured data
like images or audio.

Neural Networks 7
Minor in AI

8 Architecture of Neural Network

Figure 9: Deep Neural Network Architecture

Acyclic, Feed-Forward network of perceptrons with non-linear activation

functions
• A neural network is made up of layers:
– The input layer receives the data.
– Hidden layers process the data by learning patterns.
– The output layer gives the final result.
• Each layer is connected to the next one using lines (called connections), which pass
information forward. Note that this is an Acyclic Network.
• Each connection is adjusted during learning to make the predictions better.
• The goal of training the network is to make accurate predictions for new data.
• After training, we check how well the network works using a separate test dataset.
Objective : Given a training set S, we train a neural network to give least generalization
error. We estimate the generalization error using a test set T .

Predicting Exam Scores

We want to predict how well students perform in an exam based on:

• Hours of Sleep: The number of hours a student sleeps before the exam.

• Hours of Study: The number of hours a student studies for the exam.

The goal is to estimate the exam score for students using these two inputs with the
help of a neural network.

The neural network in the diagram consists of:

Neural Networks 8
Minor in AI

Figure 10: NN for Exam Score Prediction

• Input Layer: Two inputs (hours of sleep and hours of study).

• Hidden Layer: Two neurons that process the inputs using weights, biases, and an
activation function.
• Output Layer: A single neuron that gives the final prediction (estimated exam
score).

8.1 Notations
• The network consists of L layers labeled as {1, 2, . . . , L}.

• Input Layer: The inputs are connected to the first hidden layer, denoted as
layer 1.

• Output Layer: The final layer, denoted as layer L, is the output layer.

• Each layer ℓ has mℓ nodes, labeled as {1, 2, . . . , mℓ }.

• Node k in layer ℓ has:

(ℓ)
– A bias term bk ,
(ℓ)
– An output zk , and
(ℓ)
– An activation value ak .

• The weight on the edge from node j in layer ℓ − 1 to node k in layer ℓ is

(ℓ)
denoted as wkj .

Neural Networks 9
Minor in AI

8.2 Forward Pass: Numerical Computation in Simple

Terms
In a neural network, each node in a layer performs two main steps during a forward
pass:

1. Combine Inputs: Multiply each input by a weight, add them together, and
include a bias term. Think of this as finding the ”weighted importance” of all
the inputs, plus a baseline adjustment.

2. Apply Activation Function: Take the combined value from step 1 and
pass it through an activation function (e.g., squashing it between 0 and 1 for
a sigmoid function). This gives the node’s output.

Example: Suppose we have three inputs to a node with values a1 = 2, a2 = 3,

a3 = 1, weights w1 = 0.5, w2 = 0.2, w3 = 0.8, and bias b = 0.1.

1. Compute the weighted sum:

z = (0.5 × 2) + (0.2 × 3) + (0.8 × 1) + 0.1 = 1 + 0.6 + 0.8 + 0.1 = 2.5

2. Apply an activation function: Using a sigmoid activation function, the output

becomes:
1
a= ≈ 0.92
1 + e−2.5
The final output of the node is 0.92.
This process is repeated for every node in every layer during the forward pass.

Step-by-Step Workflow
1. Inputs: The inputs to the network are:
 
3 5
X =  5 1
10 2

Each row represents the hours of sleep and hours of study for a student.

2. Hidden Layer Calculations: In the hidden layer, each neuron processes inputs
from the previous layer. To calculate the input to a neuron, each input value is
multiplied by a corresponding weight, summed together, and then a bias term is
added. This produces the pre-activation value:
2
X
zj2 = 1
wij xi + b1j
i=1

After calculating the pre-activation value, a non-linear activation function, f (·), is

applied to this value to produce the neuron’s output:

a2j = f (zj2 )

Neural Networks 10
Minor in AI

3. Output Layer Calculations: In the output layer, the outputs from the hidden
layer are processed similarly. Each output from the hidden layer is multiplied by a
new set of weights, summed together, and a bias term is added. This produces the
pre-activation value for the output layer:
2
X
3
z = wj2 a2j + b2
j=1

The pre-activation value is then passed through an activation function f (·) to pro-
duce the final prediction:
ŷ = f (z 3 )

4. Output: The final value, ŷ, represents the predicted exam score, which is the
output of the network.

9 Learning of Parameters

Figure 11: Forward Pass and Backpropagation

9.1 Objective
(ℓ) (ℓ)
The primary goal is to find the optimal values for all the weights wkj and biases bj in
a neural network. This is achieved using the gradient descent algorithm, which
minimizes a defined cost function, denoted as C. The process involves calculating
partial derivatives of C with respect to weights and biases:
∂C ∂C
(ℓ)
and (ℓ)
.
∂wkj ∂bj

Neural Networks 11
Minor in AI

9.2 Assumptions about the Cost Function C(x)

1. For a single input x, the cost function C(x) depends only on the activation
(L)
value aj in the output layer L. Specifically, for a training input (xi , yi ), the
sum-squared error is calculated as the difference between the true value yi and
(L)
the predicted output aj :
(L)
C(xi ) = (yi − aj )2 ,
(L)
where xi and yi are fixed values (input and target output), and aj is the
output activation of the network’s output layer (a variable).
2. The total cost C is the average cost across all training samples. It is calculated
by summing the individual costs C(xi ) for each training example and dividing
by the total number of examples n:
n
1X
C= C(xi ),
n i=1

where n is the total number of training samples. For example, in the case of
the mean sum-squared error, the total cost is:
n
1X (L)
C= (yi − aj )2 .
n i=1

9.3 Perturbation Analysis and Backpropagation

To optimize the weights in the network, we need to understand how changes in
the input to each activation function affect the overall cost C. This is done by
(ℓ)
computing the derivative of the cost with respect to the pre-activation value zj at
each node j in layer ℓ:
∂C
(ℓ)
.
∂zj
Using the computed gradients, we can then calculate the gradients with respect to
the weights and biases:
∂C ∂C
(ℓ)
and (ℓ)
.
∂wkj ∂bj

9.4 Forward Pass

The forward pass refers to the process of computing the outputs of all layers of the
neural network, starting from the input layer and proceeding through each subse-
quent layer until the output layer is reached. In this process, each layer takes the
output from the previous layer, applies weights, biases, and an activation function,
and produces the output for the next layer.

Every hidden node and output has 2 values: Net Value (z) and Output
Value (a)

Neural Networks 12
Minor in AI

Figure 12: Perturbation Analysis

(a) Start with the input vector x = [x1 , x2 , x3 , . . . ], which contains the values of
the input features.
(b) For the first hidden layer (Layer 1), calculate the weighted sum of the inputs,
adding a bias term. This is done for each node k in Layer 1:
X
zk1 = 1
wkj xj + b1k ,
j

where:
• zk1 is the weighted sum for node k in Layer 1 (the pre-activation value),
• wkj1
is the weight connecting input xj to node k in Layer 1,
• bk is the bias term for node k in Layer 1.
1

(c) Apply an activation function f (·) (e.g., sigmoid, ReLU) to the pre-activation
value zk1 to compute the activation (output) of the node:

a1k = f (zk1 ),

where a1k is the output (activation) of node k in Layer 1.

(d) Repeat this process for each subsequent layer, passing the activations from
the current layer as inputs to the next layer. The process continues until the
output layer (Layer L) is reached. For the output layer, the predicted value ŷ
is the output of the last node k in Layer L:

ŷ = aLk ,

where ŷ is the predicted output of the network.

9.5 Backpropagation
The process of backpropagation involves the calculation of gradients for updating the
weights in a neural network. The diagram in Figure 14 illustrates the computation
of gradients step-by-step:

∂z ∂z
Local Gradient: ,
∂x ∂y

Neural Networks 13
Minor in AI

Figure 13: Backward Pass w.r.t each parameter

Propagation of Gradients:

– The gradient with respect to x is given by:

∂C ∂C ∂z
= ·
∂x ∂z ∂x

– Similarly, the gradient with respect to y is given by:

∂C ∂C ∂z
= ·
∂y ∂z ∂y

Figure 14: How does gradient propagate?

• ∂C
∂z
:The gradient of the cost function C with respect to the output z of the
function f .
• ∂z
∂x
: The local gradient, representing how z changes with x.
• ∂z
∂y
: The local gradient, representing how z changes with y.

These relationships showcase how the chain rule is applied in backpropagation to

compute the gradients of the loss L with respect to each input variable.

Neural Networks 14
Minor in AI

Using the chain rule, we can compute the gradients by working backwards from
the output layer to the earlier layers. Specifically, the gradient of the cost function
(ℓ)
C with respect to the pre-activation value zj at node j in layer ℓ is given by the
following formula:
First, calculate the contribution of each node k in the next layer (ℓ + 1) to the
gradient at node j in the current layer. This is done by multiplying the gradient
∂C
(ℓ+1) of the cost with respect to the pre-activation value at node k in layer (ℓ + 1)
∂zk
(ℓ+1)
by the weight wkj that connects node k in layer (ℓ + 1) to node j in layer ℓ. We
(ℓ)
then multiply by the derivative of the activation function f ′ (zj ) at node j in layer
ℓ. The final expression for the gradient is:

∂C X ∂C (ℓ+1) ′ (ℓ)
(ℓ)
= (ℓ+1)
wkj f (zj ),
∂zj k ∂zk
(ℓ)
where f ′ (zj ) is the derivative of the activation function applied at node j in layer
ℓ.

Figure 15: Gradients add at branches!

Weight and Bias Updates: Once we have computed the gradients, we update
(ℓ) (ℓ)
the weights and biases using gradient descent. The weights wkj and biases bj are
adjusted in the direction that reduces the cost function C. The updates are
performed as follows:
(ℓ)
To update the weight wkj , subtract the product of the learning rate η and the
gradient of the cost with respect to the weight:

(ℓ) (ℓ) ∂C
wkj ← wkj − η (ℓ)
,
∂wkj
(ℓ)
Similarly, to update the bias bj , subtract the product of the learning rate η and
the gradient of the cost with respect to the bias:

(ℓ) (ℓ) ∂C
bj ← bj − η (ℓ)
,
∂bj

Neural Networks 15
Minor in AI

where η is the learning rate, which determines the size of the step taken in the
direction of the negative gradient.

9.6 Key Takeaways

By combining forward pass and backpropagation:

• The forward pass computes activations layer by layer, from input to output.
• The backward pass calculates the gradients of the cost function with respect
to the weights and biases, propagating errors backward through the network.
• Finally, weights and biases are updated iteratively to minimize the cost
function and optimize the model.

This process ensures that the neural network learns to make accurate predictions
by reducing the error over time.

10 A Simple Example

Figure 16: Backpropagation : An Example

We aim to calculate the partial derivatives of the function f (x, y, z) with respect
to x, y, and z, using the chain rule.

10.1 Function Definition

The given function is:
f (x, y, z) = (x + y)z

For example, let:

x = −2, y = 5, z = −4

Neural Networks 16
Minor in AI

10.2 Intermediate Variable

Define q as:
q =x+y
Thus, we compute:
q = −2 + 5 = 3
The partial derivatives of q are:
∂q ∂q
= 1, =1
∂x ∂y

10.3 Function Simplification

The function f can now be expressed in terms of q and z:
f = qz

The partial derivatives of f are:

∂f ∂f
= z, =q
∂q ∂z
For the given values of z and q:
∂f ∂f
= −4, =3
∂q ∂z

10.4 Chain Rule

To compute the desired derivatives, we apply the chain rule:
∂f ∂f ∂q ∂f ∂f ∂q
= · , = ·
∂x ∂q ∂x ∂y ∂q ∂y
Substituting the values:
∂f ∂f
= (−4) · 1 = −4, = (−4) · 1 = −4
∂x ∂y
Additionally:
∂f
=q=3
∂z

11 Another Example
We aim to compute the gradient of the function f (w, x), which is defined as:
1
f (w, x) =
1+ e−(w0 x0 +w1 x1 +w2 )
The weights and inputs are:
w0 = 2.00, x0 = −1.00, w1 = −3.00, x1 = −2.00, w2 = −3.00

Neural Networks 17
Minor in AI

Figure 17: Backpropagation : Another Example

11.1 Forward Pass - Computation

(a) Weighted Inputs:

w0 · x0 = 2.00 · (−1.00) = −2.00

w1 · x1 = −3.00 · (−2.00) = 6.00

(b) Sum of Inputs:

−2.00 + 6.00 + (−3.00) = 4.00 + (−3.00) = 1.00

(c) Exponential and Denominator:

Exponential term: e−1.00 ≈ 0.37

Denominator: 1 + e−1.00 ≈ 1 + 0.37 = 1.37

(d) Final Output:
1
f (w, x) = ≈ 0.73
1.37

Derivative Rules
Key derivative rules used in this computation are:
df
f (x) = ex → = ex ,
dx
1 df 1
f (x) = → = − 2,
x dx x
df
fa (x) = ax → = a,
dx
df
fc (x) = c + x → = 1.
dx

Neural Networks 18
Minor in AI

11.2 Backpropagation - Computation

We aim to compute:
∂f ∂f ∂f ∂f ∂f
, , , , .
∂w0 ∂w1 ∂w2 ∂x0 ∂x1

Step 1: Derivative at the Output

The output of the sigmoid function is:

1
f= , z = w0 x0 + w1 x1 + w2 .
1 + e−z
The derivative of f with respect to z is:
∂f
= f (1 − f ) = 0.73 · (1 − 0.73) = 0.73 · 0.27 = 0.1971.
∂z

Step 2: Gradients with Respect to w0 , w1 , w2

The gradient flows backward through z = w0 x0 + w1 x1 + w2 . The partial

derivatives are:
∂z ∂z ∂z
= x0 , = x1 , = 1.
∂w0 ∂w1 ∂w2
Using the chain rule:
∂f ∂f ∂z
= · = 0.1971 · (−1.00) = −0.1971,
∂w0 ∂z ∂w0
∂f ∂f ∂z
= · = 0.1971 · (−2.00) = −0.3942,
∂w1 ∂z ∂w1
∂f ∂f ∂z
= · = 0.1971 · 1 = 0.1971.
∂w2 ∂z ∂w2

Step 3: Gradients with Respect to x0 , x1

The gradients flow through the weighted connections to x0 and x1 . The partial
derivatives are:
∂z ∂z
= w0 , = w1 .
∂x0 ∂x1
Using the chain rule:
∂f ∂f ∂z
= · = 0.1971 · 2.00 = 0.3942,
∂x0 ∂z ∂x0
∂f ∂f ∂z
= · = 0.1971 · (−3.00) = −0.5913.
∂x1 ∂z ∂x1

Neural Networks 19
Minor in AI

Final Results

The gradients are:

∂f ∂f ∂f
= −0.1971, = −0.3942, = 0.1971,
∂w0 ∂w1 ∂w2
∂f ∂f
= 0.3942, = −0.5913.
∂x0 ∂x1

Figure 18: Solution

Figure 19: Sigmoid Function

Sigmoid Function
The sigmoid function is a widely used activation function in machine learning,
particularly in neural networks. Mathematically, it is defined as:
1
σ(x) =
1 + e−x
This function maps any real-valued number to the range (0, 1), which is useful
for representing probabilities or normalized outputs.

Neural Networks 20
Minor in AI

Properties of the Sigmoid Function

(a) Range: The output of σ(x) is always between 0 and 1, i.e.,

σ(x) ∈ (0, 1) ∀x ∈ R.

(b) Asymptotic Behavior: As x → ∞, σ(x) → 1. Similarly, as x → −∞,

σ(x) → 0.

(c) Interpretation: The sigmoid function is often used to model probabil-

ities because it outputs values in the range (0, 1). Larger x corresponds
to a higher probability, and smaller x corresponds to a lower probability.

Derivative
The derivative of the sigmoid function is essential for optimization algorithms
like gradient descent. Let us compute it:
1
σ(x) =
1 + e−x
Let y = σ(x). Then:
1
y=
1 + e−x
Taking the derivative with respect to x, we apply the chain rule:

dσ(x) d 1
=
dx dx 1 + e−x

First, rewrite σ(x) as:

σ(x) = (1 + e−x )−1
Using the chain rule:
dσ(x) d
= −1 · (1 + e−x )−2 · (1 + e−x )
dx dx
The derivative of 1 + e−x is:
d
(1 + e−x ) = −e−x
dx
Substituting this back:
dσ(x)
= −1 · (1 + e−x )−2 · (−e−x )
dx
Simplify:
dσ(x) e−x
=
dx (1 + e−x )2
Now, recall that:
1
σ(x) =
1 + e−x

Neural Networks 21
Minor in AI

So:
e−x
1 − σ(x) =
1 + e−x
Substitute σ(x) and 1 − σ(x) back into the derivative:

dσ(x)
= σ(x) · (1 − σ(x))
dx

Interpretation
• The derivative of the sigmoid function reaches its maximum value when
σ(x) = 0.5, which occurs at x = 0.

• For very large or very small values of x, the derivative approaches 0,

indicating that the sigmoid function saturates and gradient updates be-
come small. This is known as the vanishing gradient problem.

Key Takeaway
The sigmoid function is smooth and differentiable, making it useful for back-
propagation in neural networks. However, its derivative shows that it can
cause slow convergence for extreme values of x due to small gradients.

12 Patterns in Backward Flow of Gradients

When performing backpropagation in a neural network, gradients flow backward
through the network. Each type of operation (or gate) has a specific way of
affecting the flow of gradients. Below are three common types of gates and their
roles in gradient propagation:

Figure 20: Backward flow patterns

Neural Networks 22
Minor in AI

12.1 Add Gate: Gradient Distributor

• The add gate combines two or more inputs by summing them:
z = x1 + x2 .
• In backpropagation, the gradient of the cost function C with respect to z is
distributed equally to the inputs:
∂C ∂C ∂C ∂C
= , = .
∂x1 ∂z ∂x2 ∂z
• Role: The add gate acts as a gradient distributor, splitting the gradient
and sending it to all inputs.

12.2 Max Gate: Gradient Router

• The max gate selects the maximum value from two inputs:
z = max(x1 , x2 ).
• In backpropagation, the gradient of the cost function C with respect to z is
routed only to the input that contributed the maximum value:
( (
∂C ∂C
∂C ∂z
, if x1 > x2 , ∂C , if x2 > x1 ,
= = ∂z
∂x1 0, otherwise. ∂x2 0, otherwise.
• Role: The max gate acts as a gradient router, directing the gradient to the
input with the highest value.

12.3 Multiply (Mul) Gate: Gradient Switcher

• The multiply gate multiplies two inputs:
z = x1 ∗ x2 .
• In backpropagation, the gradient of the cost function C with respect to z is
split based on the values of the other input:
∂C ∂C ∂C ∂C
= · x2 , = · x1 .
∂x1 ∂z ∂x2 ∂z
• Role: The multiply gate acts as a gradient switcher, scaling the gradient
for one input by the value of the other input.

Summary of Patterns
1. Add Gate: Distributes gradients equally to all inputs.
2. Max Gate: Routes gradients only to the input that contributed the maximum
value.
3. Mul Gate: Switches or scales gradients between inputs based on their values.

These patterns help to understand how gradients flow through different parts of a
neural network, enabling efficient backpropagation.

Neural Networks 23
Minor in AI

13 Food for Thought

Figure 21: Left for You... Explore & Have Fun!!!

Neural Networks 24
Minor in AI

14 Additional Resources
[1] Fei-Fei Li, Lecture 4: Backpropagation and Neural Networks, Stanford
University ·
[2] Michael Nielsen, Neural Networks and Deep Learning, Chapter 2, Lambda ·
[3] Phillip Isola, Using and Understanding Deep Neural Nets, Tutorial Series, MIT
·
[4] M. Shifas PV, Introduction to Neural Networks and Application in Speech
Enhancement, Department of CSE, University of Crete ·

14.1 Partial Derivatives

What are Partial Derivatives?

• Partial derivatives represent the rate of change of a function with respect to

one variable while keeping other variables constant. For example, if f (x, y) is
a function of x and y, ∂f
∂x
measures how f changes as x changes, holding y
fixed.
• Visual Representation: Imagine a landscape where height represents the
function’s value. Moving along the x direction gives ∂f
∂x
, and moving along y
∂f
gives ∂y .

Figure 22: Geometry of Partial Derivatives

Interpretation and Real-World Context

• For a function f (x, y, z):

∂f
– ∂x
indicates how f changes as x changes, with y and z constant.

Neural Networks 25
Minor in AI

∂f ∂f
– ∂y
and ∂z
follow similarly.
• In economics, if f (x, y) represents profit as a function of production levels x
and y, ∂f
∂x
shows the change in profit when increasing x, keeping y constant.

Mathematical Definition

For a multivariable function f (x1 , x2 , . . . , xn ), the partial derivative with respect to

xk is:
∂f f (x1 , . . . , xk + ∆xk , . . . , xn ) − f (x1 , . . . , xk , . . . , xn )
= lim .
∂xk ∆xk →0 ∆xk
This measures the instantaneous rate of change of f concerning xk .

Gradient and Applications

The gradient ∇f of f (x1 , x2 , . . . , xn ) is:

 ∂f 
∂x1
 ∂f 
 ∂x2 
∇f =  .  .
 .. 
∂f
∂xn

It points in the direction of steepest increase and is crucial in optimization and

machine learning.

Examples

(a) Function Example: For f (x, y) = x2 + 3xy + y 2 , partial derivatives are:

∂f ∂f
= 2x + 3y, = 3x + 2y.
∂x ∂y
∂C ∂C
(b) Neural Networks: In gradient descent, partial derivatives ∂w
and ∂b
are
used to update weights w and biases b:
∂C ∂C
w ←w−η , b←b−η .
∂w ∂b

Geometric Interpretation

- A partial derivative ∂f
∂x
can be visualized as the slope of the tangent line to the
curve obtained by fixing all variables except x. - Similarly, ∂f
∂y
is the slope in the y
direction.

Neural Networks 26
Minor in AI

Key Takeaways

• Partial derivatives measure the sensitivity of functions to changes in specific

variables.
• They are essential for optimization, physics, economics, and machine learning.
• Techniques like gradient-based optimization heavily rely on partial
derivatives.

14.2 Forward Pass: Mathematical Representation

In mathematical terms, the computations for a single node k in layer ℓ are as
follows:

(a) Compute the weighted sum of inputs plus bias:

mℓ−1
(ℓ)
X (ℓ) (ℓ−1) (ℓ)
zk = wkj aj + bk
j=1

Where:
(ℓ)
• zk : Pre-activation value for node k in layer ℓ.
(ℓ)
• wkj : Weight connecting the j-th node in layer ℓ − 1 to node k in layer ℓ.
(ℓ−1)
• aj : Activation (output) of the j-th node in layer ℓ − 1.
(ℓ)
• bk :Bias term for node k in layer ℓ.
• mℓ−1 : Number of nodes in the previous layer (ℓ − 1).
(b) Apply the activation function:

(ℓ) (ℓ)
ak = f zk

Where f (·) is the activation function.

The above equations represent the forward pass for a single node in a neural
network. Extend this computation to all nodes in a layer, and subsequently, to all
layers in the network for a full forward pass.

Neural Networks 27

Introduction To Deep Learning
100% (1)
Introduction To Deep Learning
24 pages
TensorFlow Regression
No ratings yet
TensorFlow Regression
445 pages
AI ML Session Slides
No ratings yet
AI ML Session Slides
34 pages
The Little Book of Deep Learning
100% (1)
The Little Book of Deep Learning
140 pages
Little Book of Deep Learning
100% (1)
Little Book of Deep Learning
158 pages
Lect 2 Common Architectural Principles of Deep Networks
No ratings yet
Lect 2 Common Architectural Principles of Deep Networks
20 pages
DL Unit 1
No ratings yet
DL Unit 1
200 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
155 pages
Unit 1
No ratings yet
Unit 1
70 pages
04introduction To Neural Networks
No ratings yet
04introduction To Neural Networks
62 pages
ML Unit Ii
No ratings yet
ML Unit Ii
16 pages
Neural Networks in Healthcare Lecture 2 - 021808
No ratings yet
Neural Networks in Healthcare Lecture 2 - 021808
73 pages
Revised Group 3 Chapter 123 OJT Monitoring System
100% (1)
Revised Group 3 Chapter 123 OJT Monitoring System
21 pages
What Is AI & Machine
No ratings yet
What Is AI & Machine
8 pages
Mlnov 2024
No ratings yet
Mlnov 2024
2 pages
Unit 1
No ratings yet
Unit 1
16 pages
AA12 Deep Learning 2024
No ratings yet
AA12 Deep Learning 2024
30 pages
Chapter 11 Neural Nets
No ratings yet
Chapter 11 Neural Nets
39 pages
Unit II Natural Tolerance Limits, Specification Limits, Process Capability
100% (1)
Unit II Natural Tolerance Limits, Specification Limits, Process Capability
13 pages
LBDL
No ratings yet
LBDL
142 pages
DL Concepts 1 Overview
No ratings yet
DL Concepts 1 Overview
80 pages
Lecture 6 Formulating The Research Design
No ratings yet
Lecture 6 Formulating The Research Design
39 pages
Unit 3, Pharmaceutical Jurisprudence, B Pharmacy 5th Sem, Carewell Pharma
No ratings yet
Unit 3, Pharmaceutical Jurisprudence, B Pharmacy 5th Sem, Carewell Pharma
13 pages
Module 2
No ratings yet
Module 2
73 pages
Cracking The AI Code - Unlocking The Secrets of Machine Learning
No ratings yet
Cracking The AI Code - Unlocking The Secrets of Machine Learning
18 pages
Clevered AI Wizard Level 3
No ratings yet
Clevered AI Wizard Level 3
17 pages
Chapter 11 Neural Nets (Python)
No ratings yet
Chapter 11 Neural Nets (Python)
43 pages
Alice Book Volume 1
No ratings yet
Alice Book Volume 1
378 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
22 pages
DL Unit 3 Important Questions and Answers PDF .. - 1
No ratings yet
DL Unit 3 Important Questions and Answers PDF .. - 1
8 pages
Wine Scientific Paper
No ratings yet
Wine Scientific Paper
4 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
168 pages
ML LittelBook
No ratings yet
ML LittelBook
161 pages
Artificial Intelligence - Chapter 7
No ratings yet
Artificial Intelligence - Chapter 7
18 pages
AI-Lecture 8 (Machine Learning Overview)
No ratings yet
AI-Lecture 8 (Machine Learning Overview)
42 pages
Lecture2 Slides 1
No ratings yet
Lecture2 Slides 1
28 pages
LBDL
No ratings yet
LBDL
156 pages
Ann 5TH
No ratings yet
Ann 5TH
98 pages
Machine Learning Deep Learning Overview AIST
No ratings yet
Machine Learning Deep Learning Overview AIST
86 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
140 pages
The - Little - Book - of - Deep Learning
No ratings yet
The - Little - Book - of - Deep Learning
140 pages
CHAPTER ONE-cost accounting-II
No ratings yet
CHAPTER ONE-cost accounting-II
5 pages
CH 19 AI
No ratings yet
CH 19 AI
17 pages
The Deep Learning Revolution: Introductory Overview Lecture
No ratings yet
The Deep Learning Revolution: Introductory Overview Lecture
35 pages
2 Notes
No ratings yet
2 Notes
2 pages
LBDL
No ratings yet
LBDL
185 pages
Module1 ECO-598 AI & ML Aug 21
No ratings yet
Module1 ECO-598 AI & ML Aug 21
45 pages
Deep Learning Day 27
No ratings yet
Deep Learning Day 27
43 pages
Get Fundamentals of Statistics For Aviation Research 1st Edition Michael A. Gallo Free All Chapters
100% (2)
Get Fundamentals of Statistics For Aviation Research 1st Edition Michael A. Gallo Free All Chapters
40 pages
Introduction To Deep Learning - With Complexe Python and TensorFlow Examples - Jürgen Brauer PDF
No ratings yet
Introduction To Deep Learning - With Complexe Python and TensorFlow Examples - Jürgen Brauer PDF
245 pages
Unit Ii ML
No ratings yet
Unit Ii ML
22 pages
Image Classification Using Small Convolutional Neural Network
No ratings yet
Image Classification Using Small Convolutional Neural Network
5 pages
Artificial Neural Network Concepts and Examples
No ratings yet
Artificial Neural Network Concepts and Examples
61 pages
HRM - The Analysis and Design of Work
No ratings yet
HRM - The Analysis and Design of Work
29 pages
Revision and Reflection L4M2 v1-3
100% (1)
Revision and Reflection L4M2 v1-3
15 pages
Artificial Intelligence Notes
No ratings yet
Artificial Intelligence Notes
11 pages
1163
No ratings yet
1163
7 pages
Module 3
No ratings yet
Module 3
97 pages
Lecture 1
No ratings yet
Lecture 1
38 pages
Seminar Report cnn1
No ratings yet
Seminar Report cnn1
23 pages
Bilingual Effects On Cognitive Abilities
No ratings yet
Bilingual Effects On Cognitive Abilities
11 pages
Qualitative Research Methods in Pharmacy Practice: Disclosure of Relevant Financial Relationships
No ratings yet
Qualitative Research Methods in Pharmacy Practice: Disclosure of Relevant Financial Relationships
7 pages
Introduction: Basic Concepts: Course: Psycholinguistics
No ratings yet
Introduction: Basic Concepts: Course: Psycholinguistics
17 pages
Scantron OPSCAN 4ES Brochure From AXIS IT
No ratings yet
Scantron OPSCAN 4ES Brochure From AXIS IT
2 pages
AI Facilitators Handbook Xprint
No ratings yet
AI Facilitators Handbook Xprint
197 pages
Impact of Media Coverage in Folk Culture
No ratings yet
Impact of Media Coverage in Folk Culture
59 pages
Notes DL-1
No ratings yet
Notes DL-1
10 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
21 pages
Data Gathering Instruments: 1-Interview Flow Guide
No ratings yet
Data Gathering Instruments: 1-Interview Flow Guide
5 pages
Deep Learning - Unit 1 Notes
No ratings yet
Deep Learning - Unit 1 Notes
27 pages
The Role of Kaizen in Creating PDF
No ratings yet
The Role of Kaizen in Creating PDF
21 pages
City Premier College
No ratings yet
City Premier College
10 pages
Updated Cosmetics Europe PIF Guidelines - 2015 - Update
No ratings yet
Updated Cosmetics Europe PIF Guidelines - 2015 - Update
31 pages
Unit 2
No ratings yet
Unit 2
76 pages
Course Completion Certificate
No ratings yet
Course Completion Certificate
5 pages
Unit IIAIProjectCycle
No ratings yet
Unit IIAIProjectCycle
9 pages
Neural Networks and Deep Learning
No ratings yet
Neural Networks and Deep Learning
22 pages
IEEE Paper Format Template
No ratings yet
IEEE Paper Format Template
4 pages
Marketing Journal - Analysis of The Influence of Service Quality On Customer Satisfaction and Its Impact On Reuse Intention of Mobile Banking Payment in E-Commerce Transactions
No ratings yet
Marketing Journal - Analysis of The Influence of Service Quality On Customer Satisfaction and Its Impact On Reuse Intention of Mobile Banking Payment in E-Commerce Transactions
7 pages
Table of Contents
No ratings yet
Table of Contents
5 pages
Deep Neural Network
No ratings yet
Deep Neural Network
12 pages
BKTReport
No ratings yet
BKTReport
10 pages
HRA Mod 1
No ratings yet
HRA Mod 1
12 pages
Deep Learning Concise Notes
No ratings yet
Deep Learning Concise Notes
4 pages
Toliet Paper Strength-Student
No ratings yet
Toliet Paper Strength-Student
3 pages
Deep Learning (DL) - Comprehensive Summary
No ratings yet
Deep Learning (DL) - Comprehensive Summary
9 pages
Published TanviSingh BodyShamingEmotionRegulationandLifeJournalofPsychology9!4!24 2
No ratings yet
Published TanviSingh BodyShamingEmotionRegulationandLifeJournalofPsychology9!4!24 2
9 pages
Priyanka Patra Dissertation
No ratings yet
Priyanka Patra Dissertation
48 pages
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Artificial Intelligence Algorithms
From Everand
Artificial Intelligence Algorithms
akosnemeth
No ratings yet