0% found this document useful (0 votes)
4 views28 pages

Neural Network 1

The document provides an overview of predictive modeling, machine learning (ML), and deep learning (DL), focusing on their applications in classifying images of dogs and cats. It discusses the differences between ML and DL, the importance of features in classification, and the architecture and learning processes of neural networks. Additionally, it highlights the challenges in classification tasks and the benefits of using deep learning for complex problems with unstructured data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views28 pages

Neural Network 1

The document provides an overview of predictive modeling, machine learning (ML), and deep learning (DL), focusing on their applications in classifying images of dogs and cats. It discusses the differences between ML and DL, the importance of features in classification, and the architecture and learning processes of neural networks. Additionally, it highlights the challenges in classification tasks and the benefits of using deep learning for complex problems with unstructured data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Minor in AI

Neural Networks

December 30, 2024


Minor in AI

1 Predictive Modelling
Predictive modelling is about creating a mathematical function that predicts the out-
put (or target) from some input. Think of it as learning a function

f (input) = output

where, the goal is to make f as accurate as possible for unseen data.

Figure 1: Workflow

2 Machine Learning (ML)


Machine Learning is a subset of artificial intelligence (AI) that teaches computers to
learn patterns from data without being explicitly programmed.

2.1 How Does It Work?


1. Given labeled data (e.g., images of animals with the correct labels ”dog” or ”cat”),
ML algorithms learn to map inputs to outputs using mathematical models.
2. Common algorithms: Decision trees, Support Vector Machines (SVM), k-Nearest
Neighbors (kNN), and Random Forests.

3 Deep Learning (DL)


Deep Learning is a specialized subset of Machine Learning that uses neural networks with
multiple layers (hence ”deep”) to learn patterns directly from raw data. Unlike traditional
ML, DL automatically extracts relevant features.

3.1 How Does It Work?


1. Input (e.g., raw images) passes through multiple layers of neurons that process the
data and identify patterns (e.g., recognizing ears, eyes, and tails).

Neural Networks 1
Minor in AI

2. Deep Learning is particularly effective when dealing with unstructured data like
images, audio, and text.

Figure 2: AI Sphere

4 Dog and Cat Classification


4.1 Problem Statment
In this problem, the goal is to develop a machine learning or deep learning model capable
of accurately classifying images of animals into one of two categories: dog or cat. Given
an input image of an animal, the model should analyze the visual features in the image
and predict whether the animal is a dog or a cat.

Figure 3: Dog vs Cat

Neural Networks 2
Minor in AI

4.2 Challenges
This task is challenging due to:

• Similar Features: Dogs and cats may share similar features such as fur patterns,
shapes, or poses, making it difficult for a model to distinguish between them.

• Diverse Conditions: Images can vary widely in terms of lighting, background,


angles, and the sizes of the animals.

4.3 Objective
The objective is to build a classification model that takes an image as input and outputs
a prediction (dog or cat) with high accuracy, even when faced with unseen or diverse
images.

4.4 ML Approach
• In ML, we extract features manually (e.g., the shape of ears, tail, or color patterns).

• These features are then used to train a model (e.g., SVM) to classify the animal as
a dog or a cat.

Figure 4: ML Workflow

4.5 DL Approach
• A DL model like Convolutional Neural Networks (CNNs) can process raw image
pixels, learn to identify unique features (e.g., fur texture or shape of the face), and
classify the image as ”dog” or ”cat.”

Neural Networks 3
Minor in AI

Figure 5: D Workflow

5 Features and Their Importance


5.1 What are Features?
Features are individual measurable properties or characteristics of the data used for
prediction.
In the dog vs. cat problem, features could be the length of ears, shape of the nose, or
pixel intensity values.

5.2 Why are Features Important?


1. Good features make it easier for a model to distinguish between classes.

2. In ML, feature engineering is critical: manually selecting, extracting, or transform-


ing features from the data.

5.3 Feature Learning


Feature learning is a critical component of this classification problem, as it involves
identifying and extracting patterns or characteristics from the images that are most rele-
vant for distinguishing between dogs and cats.

5.3.1 Key Aspects


• Manual Feature Extraction: Traditional machine learning models, such as sup-
port vector machines (SVMs), often rely on handcrafted features, such as edge de-
tection or color histograms. These features are manually designed based on domain
expertise.

• Automatic Feature Extraction: Deep learning models, especially convolutional


neural networks (CNNs), automatically learn hierarchical features from the raw data
during training.

Neural Networks 4
Minor in AI

– Lower layers learn simple features, such as edges and textures.


– Higher layers learn more abstract features, such as shapes, eyes, ears, or tails,
which are specific to dogs or cats.

• Feature Importance: Not all features contribute equally to the classification. For
example:

– The presence of pointy ears might be more relevant for identifying cats.
– A broader face or specific fur patterns might indicate a dog.

Figure 6: Feature Learning

Benefits
• Reduced Dependence on Domain Knowledge: Unlike traditional models,
CNNs can automatically learn relevant features, reducing the need for manual in-
tervention.

• Scalability: Feature learning in deep learning is highly scalable to large and diverse
datasets.

• Improved Accuracy: Automatically learned features are often more effective in


capturing subtle patterns, leading to higher classification accuracy.

6 Modalities and Their Integration

Figure 7: Modalities

Neural Networks 5
Minor in AI

6.1 What are Modalities?


• Modalities refer to different types of data sources. For example:

– Images (e.g., dog and cat photos).


– Text (e.g., a description of the animal).
– Audio (e.g., the sound made by the animal).

6.2 Integration of Modalities


• Combining different modalities can improve prediction.

• Example: For classifying animals, combining image data with text descriptions
could provide additional context to improve accuracy.

7 ML vs. DL

Figure 8: ML vs DL

7.1 Key Differences


• Feature Extraction:

– ML requires manual feature engineering.


– DL automatically extracts features.

• Data Dependency:

– ML works well with small to medium-sized datasets.


– DL needs large datasets to perform effectively.

• Complexity:

– ML algorithms are simpler and faster to train.


– DL models are computationally intensive and require GPUs.

Neural Networks 6
Minor in AI

7.2 Key Takeaway


1. Use ML for simpler problems or when data is limited.

2. Use DL for complex problems, especially with large datasets and unstructured data
like images or audio.

Neural Networks 7
Minor in AI

8 Architecture of Neural Network

Figure 9: Deep Neural Network Architecture

Acyclic, Feed-Forward network of perceptrons with non-linear activation


functions
• A neural network is made up of layers:
– The input layer receives the data.
– Hidden layers process the data by learning patterns.
– The output layer gives the final result.
• Each layer is connected to the next one using lines (called connections), which pass
information forward. Note that this is an Acyclic Network.
• Each connection is adjusted during learning to make the predictions better.
• The goal of training the network is to make accurate predictions for new data.
• After training, we check how well the network works using a separate test dataset.
Objective : Given a training set S, we train a neural network to give least generalization
error. We estimate the generalization error using a test set T .

Predicting Exam Scores


We want to predict how well students perform in an exam based on:

• Hours of Sleep: The number of hours a student sleeps before the exam.

• Hours of Study: The number of hours a student studies for the exam.

The goal is to estimate the exam score for students using these two inputs with the
help of a neural network.

The neural network in the diagram consists of:

Neural Networks 8
Minor in AI

Figure 10: NN for Exam Score Prediction

• Input Layer: Two inputs (hours of sleep and hours of study).


• Hidden Layer: Two neurons that process the inputs using weights, biases, and an
activation function.
• Output Layer: A single neuron that gives the final prediction (estimated exam
score).

8.1 Notations
• The network consists of L layers labeled as {1, 2, . . . , L}.

• Input Layer: The inputs are connected to the first hidden layer, denoted as
layer 1.

• Output Layer: The final layer, denoted as layer L, is the output layer.

• Each layer ℓ has mℓ nodes, labeled as {1, 2, . . . , mℓ }.

• Node k in layer ℓ has:


(ℓ)
– A bias term bk ,
(ℓ)
– An output zk , and
(ℓ)
– An activation value ak .

• The weight on the edge from node j in layer ℓ − 1 to node k in layer ℓ is


(ℓ)
denoted as wkj .

Neural Networks 9
Minor in AI

8.2 Forward Pass: Numerical Computation in Simple


Terms
In a neural network, each node in a layer performs two main steps during a forward
pass:

1. Combine Inputs: Multiply each input by a weight, add them together, and
include a bias term. Think of this as finding the ”weighted importance” of all
the inputs, plus a baseline adjustment.

2. Apply Activation Function: Take the combined value from step 1 and
pass it through an activation function (e.g., squashing it between 0 and 1 for
a sigmoid function). This gives the node’s output.

Example: Suppose we have three inputs to a node with values a1 = 2, a2 = 3,


a3 = 1, weights w1 = 0.5, w2 = 0.2, w3 = 0.8, and bias b = 0.1.

1. Compute the weighted sum:

z = (0.5 × 2) + (0.2 × 3) + (0.8 × 1) + 0.1 = 1 + 0.6 + 0.8 + 0.1 = 2.5

2. Apply an activation function: Using a sigmoid activation function, the output


becomes:
1
a= ≈ 0.92
1 + e−2.5
The final output of the node is 0.92.
This process is repeated for every node in every layer during the forward pass.

Step-by-Step Workflow
1. Inputs: The inputs to the network are:
 
3 5
X =  5 1
10 2

Each row represents the hours of sleep and hours of study for a student.

2. Hidden Layer Calculations: In the hidden layer, each neuron processes inputs
from the previous layer. To calculate the input to a neuron, each input value is
multiplied by a corresponding weight, summed together, and then a bias term is
added. This produces the pre-activation value:
2
X
zj2 = 1
wij xi + b1j
i=1

After calculating the pre-activation value, a non-linear activation function, f (·), is


applied to this value to produce the neuron’s output:

a2j = f (zj2 )

Neural Networks 10
Minor in AI

3. Output Layer Calculations: In the output layer, the outputs from the hidden
layer are processed similarly. Each output from the hidden layer is multiplied by a
new set of weights, summed together, and a bias term is added. This produces the
pre-activation value for the output layer:
2
X
3
z = wj2 a2j + b2
j=1

The pre-activation value is then passed through an activation function f (·) to pro-
duce the final prediction:
ŷ = f (z 3 )

4. Output: The final value, ŷ, represents the predicted exam score, which is the
output of the network.

9 Learning of Parameters

Figure 11: Forward Pass and Backpropagation

9.1 Objective
(ℓ) (ℓ)
The primary goal is to find the optimal values for all the weights wkj and biases bj in
a neural network. This is achieved using the gradient descent algorithm, which
minimizes a defined cost function, denoted as C. The process involves calculating
partial derivatives of C with respect to weights and biases:
∂C ∂C
(ℓ)
and (ℓ)
.
∂wkj ∂bj

Neural Networks 11
Minor in AI

9.2 Assumptions about the Cost Function C(x)


1. For a single input x, the cost function C(x) depends only on the activation
(L)
value aj in the output layer L. Specifically, for a training input (xi , yi ), the
sum-squared error is calculated as the difference between the true value yi and
(L)
the predicted output aj :
(L)
C(xi ) = (yi − aj )2 ,
(L)
where xi and yi are fixed values (input and target output), and aj is the
output activation of the network’s output layer (a variable).
2. The total cost C is the average cost across all training samples. It is calculated
by summing the individual costs C(xi ) for each training example and dividing
by the total number of examples n:
n
1X
C= C(xi ),
n i=1

where n is the total number of training samples. For example, in the case of
the mean sum-squared error, the total cost is:
n
1X (L)
C= (yi − aj )2 .
n i=1

9.3 Perturbation Analysis and Backpropagation


To optimize the weights in the network, we need to understand how changes in
the input to each activation function affect the overall cost C. This is done by
(ℓ)
computing the derivative of the cost with respect to the pre-activation value zj at
each node j in layer ℓ:
∂C
(ℓ)
.
∂zj
Using the computed gradients, we can then calculate the gradients with respect to
the weights and biases:
∂C ∂C
(ℓ)
and (ℓ)
.
∂wkj ∂bj

9.4 Forward Pass


The forward pass refers to the process of computing the outputs of all layers of the
neural network, starting from the input layer and proceeding through each subse-
quent layer until the output layer is reached. In this process, each layer takes the
output from the previous layer, applies weights, biases, and an activation function,
and produces the output for the next layer.

Every hidden node and output has 2 values: Net Value (z) and Output
Value (a)

Neural Networks 12
Minor in AI

Figure 12: Perturbation Analysis

(a) Start with the input vector x = [x1 , x2 , x3 , . . . ], which contains the values of
the input features.
(b) For the first hidden layer (Layer 1), calculate the weighted sum of the inputs,
adding a bias term. This is done for each node k in Layer 1:
X
zk1 = 1
wkj xj + b1k ,
j

where:
• zk1 is the weighted sum for node k in Layer 1 (the pre-activation value),
• wkj1
is the weight connecting input xj to node k in Layer 1,
• bk is the bias term for node k in Layer 1.
1

(c) Apply an activation function f (·) (e.g., sigmoid, ReLU) to the pre-activation
value zk1 to compute the activation (output) of the node:

a1k = f (zk1 ),

where a1k is the output (activation) of node k in Layer 1.


(d) Repeat this process for each subsequent layer, passing the activations from
the current layer as inputs to the next layer. The process continues until the
output layer (Layer L) is reached. For the output layer, the predicted value ŷ
is the output of the last node k in Layer L:

ŷ = aLk ,

where ŷ is the predicted output of the network.

9.5 Backpropagation
The process of backpropagation involves the calculation of gradients for updating the
weights in a neural network. The diagram in Figure 14 illustrates the computation
of gradients step-by-step:

∂z ∂z
Local Gradient: ,
∂x ∂y

Neural Networks 13
Minor in AI

Figure 13: Backward Pass w.r.t each parameter

Propagation of Gradients:

– The gradient with respect to x is given by:


∂C ∂C ∂z
= ·
∂x ∂z ∂x

– Similarly, the gradient with respect to y is given by:


∂C ∂C ∂z
= ·
∂y ∂z ∂y

Figure 14: How does gradient propagate?

• ∂C
∂z
:The gradient of the cost function C with respect to the output z of the
function f .
• ∂z
∂x
: The local gradient, representing how z changes with x.
• ∂z
∂y
: The local gradient, representing how z changes with y.

These relationships showcase how the chain rule is applied in backpropagation to


compute the gradients of the loss L with respect to each input variable.

Neural Networks 14
Minor in AI

Using the chain rule, we can compute the gradients by working backwards from
the output layer to the earlier layers. Specifically, the gradient of the cost function
(ℓ)
C with respect to the pre-activation value zj at node j in layer ℓ is given by the
following formula:
First, calculate the contribution of each node k in the next layer (ℓ + 1) to the
gradient at node j in the current layer. This is done by multiplying the gradient
∂C
(ℓ+1) of the cost with respect to the pre-activation value at node k in layer (ℓ + 1)
∂zk
(ℓ+1)
by the weight wkj that connects node k in layer (ℓ + 1) to node j in layer ℓ. We
(ℓ)
then multiply by the derivative of the activation function f ′ (zj ) at node j in layer
ℓ. The final expression for the gradient is:

∂C X ∂C (ℓ+1) ′ (ℓ)
(ℓ)
= (ℓ+1)
wkj f (zj ),
∂zj k ∂zk
(ℓ)
where f ′ (zj ) is the derivative of the activation function applied at node j in layer
ℓ.

Figure 15: Gradients add at branches!

Weight and Bias Updates: Once we have computed the gradients, we update
(ℓ) (ℓ)
the weights and biases using gradient descent. The weights wkj and biases bj are
adjusted in the direction that reduces the cost function C. The updates are
performed as follows:
(ℓ)
To update the weight wkj , subtract the product of the learning rate η and the
gradient of the cost with respect to the weight:

(ℓ) (ℓ) ∂C
wkj ← wkj − η (ℓ)
,
∂wkj
(ℓ)
Similarly, to update the bias bj , subtract the product of the learning rate η and
the gradient of the cost with respect to the bias:

(ℓ) (ℓ) ∂C
bj ← bj − η (ℓ)
,
∂bj

Neural Networks 15
Minor in AI

where η is the learning rate, which determines the size of the step taken in the
direction of the negative gradient.

9.6 Key Takeaways


By combining forward pass and backpropagation:

• The forward pass computes activations layer by layer, from input to output.
• The backward pass calculates the gradients of the cost function with respect
to the weights and biases, propagating errors backward through the network.
• Finally, weights and biases are updated iteratively to minimize the cost
function and optimize the model.

This process ensures that the neural network learns to make accurate predictions
by reducing the error over time.

10 A Simple Example

Figure 16: Backpropagation : An Example

We aim to calculate the partial derivatives of the function f (x, y, z) with respect
to x, y, and z, using the chain rule.

10.1 Function Definition


The given function is:
f (x, y, z) = (x + y)z

For example, let:


x = −2, y = 5, z = −4

Neural Networks 16
Minor in AI

10.2 Intermediate Variable


Define q as:
q =x+y
Thus, we compute:
q = −2 + 5 = 3
The partial derivatives of q are:
∂q ∂q
= 1, =1
∂x ∂y

10.3 Function Simplification


The function f can now be expressed in terms of q and z:
f = qz

The partial derivatives of f are:


∂f ∂f
= z, =q
∂q ∂z
For the given values of z and q:
∂f ∂f
= −4, =3
∂q ∂z

10.4 Chain Rule


To compute the desired derivatives, we apply the chain rule:
∂f ∂f ∂q ∂f ∂f ∂q
= · , = ·
∂x ∂q ∂x ∂y ∂q ∂y
Substituting the values:
∂f ∂f
= (−4) · 1 = −4, = (−4) · 1 = −4
∂x ∂y
Additionally:
∂f
=q=3
∂z

11 Another Example
We aim to compute the gradient of the function f (w, x), which is defined as:
1
f (w, x) =
1+ e−(w0 x0 +w1 x1 +w2 )
The weights and inputs are:
w0 = 2.00, x0 = −1.00, w1 = −3.00, x1 = −2.00, w2 = −3.00

Neural Networks 17
Minor in AI

Figure 17: Backpropagation : Another Example

11.1 Forward Pass - Computation


(a) Weighted Inputs:

w0 · x0 = 2.00 · (−1.00) = −2.00

w1 · x1 = −3.00 · (−2.00) = 6.00


(b) Sum of Inputs:

−2.00 + 6.00 + (−3.00) = 4.00 + (−3.00) = 1.00

(c) Exponential and Denominator:

Exponential term: e−1.00 ≈ 0.37

Denominator: 1 + e−1.00 ≈ 1 + 0.37 = 1.37


(d) Final Output:
1
f (w, x) = ≈ 0.73
1.37

Derivative Rules
Key derivative rules used in this computation are:
df
f (x) = ex → = ex ,
dx
1 df 1
f (x) = → = − 2,
x dx x
df
fa (x) = ax → = a,
dx
df
fc (x) = c + x → = 1.
dx

Neural Networks 18
Minor in AI

11.2 Backpropagation - Computation


We aim to compute:
∂f ∂f ∂f ∂f ∂f
, , , , .
∂w0 ∂w1 ∂w2 ∂x0 ∂x1

Step 1: Derivative at the Output

The output of the sigmoid function is:


1
f= , z = w0 x0 + w1 x1 + w2 .
1 + e−z
The derivative of f with respect to z is:
∂f
= f (1 − f ) = 0.73 · (1 − 0.73) = 0.73 · 0.27 = 0.1971.
∂z

Step 2: Gradients with Respect to w0 , w1 , w2

The gradient flows backward through z = w0 x0 + w1 x1 + w2 . The partial


derivatives are:
∂z ∂z ∂z
= x0 , = x1 , = 1.
∂w0 ∂w1 ∂w2
Using the chain rule:
∂f ∂f ∂z
= · = 0.1971 · (−1.00) = −0.1971,
∂w0 ∂z ∂w0
∂f ∂f ∂z
= · = 0.1971 · (−2.00) = −0.3942,
∂w1 ∂z ∂w1
∂f ∂f ∂z
= · = 0.1971 · 1 = 0.1971.
∂w2 ∂z ∂w2

Step 3: Gradients with Respect to x0 , x1

The gradients flow through the weighted connections to x0 and x1 . The partial
derivatives are:
∂z ∂z
= w0 , = w1 .
∂x0 ∂x1
Using the chain rule:
∂f ∂f ∂z
= · = 0.1971 · 2.00 = 0.3942,
∂x0 ∂z ∂x0
∂f ∂f ∂z
= · = 0.1971 · (−3.00) = −0.5913.
∂x1 ∂z ∂x1

Neural Networks 19
Minor in AI

Final Results

The gradients are:


∂f ∂f ∂f
= −0.1971, = −0.3942, = 0.1971,
∂w0 ∂w1 ∂w2
∂f ∂f
= 0.3942, = −0.5913.
∂x0 ∂x1

Figure 18: Solution

Figure 19: Sigmoid Function

Sigmoid Function
The sigmoid function is a widely used activation function in machine learning,
particularly in neural networks. Mathematically, it is defined as:
1
σ(x) =
1 + e−x
This function maps any real-valued number to the range (0, 1), which is useful
for representing probabilities or normalized outputs.

Neural Networks 20
Minor in AI

Properties of the Sigmoid Function


(a) Range: The output of σ(x) is always between 0 and 1, i.e.,

σ(x) ∈ (0, 1) ∀x ∈ R.

(b) Asymptotic Behavior: As x → ∞, σ(x) → 1. Similarly, as x → −∞,


σ(x) → 0.

(c) Interpretation: The sigmoid function is often used to model probabil-


ities because it outputs values in the range (0, 1). Larger x corresponds
to a higher probability, and smaller x corresponds to a lower probability.

Derivative
The derivative of the sigmoid function is essential for optimization algorithms
like gradient descent. Let us compute it:
1
σ(x) =
1 + e−x
Let y = σ(x). Then:
1
y=
1 + e−x
Taking the derivative with respect to x, we apply the chain rule:
 
dσ(x) d 1
=
dx dx 1 + e−x

First, rewrite σ(x) as:


σ(x) = (1 + e−x )−1
Using the chain rule:
dσ(x) d
= −1 · (1 + e−x )−2 · (1 + e−x )
dx dx
The derivative of 1 + e−x is:
d
(1 + e−x ) = −e−x
dx
Substituting this back:
dσ(x)
= −1 · (1 + e−x )−2 · (−e−x )
dx
Simplify:
dσ(x) e−x
=
dx (1 + e−x )2
Now, recall that:
1
σ(x) =
1 + e−x

Neural Networks 21
Minor in AI

So:
e−x
1 − σ(x) =
1 + e−x
Substitute σ(x) and 1 − σ(x) back into the derivative:

dσ(x)
= σ(x) · (1 − σ(x))
dx

Interpretation
• The derivative of the sigmoid function reaches its maximum value when
σ(x) = 0.5, which occurs at x = 0.

• For very large or very small values of x, the derivative approaches 0,


indicating that the sigmoid function saturates and gradient updates be-
come small. This is known as the vanishing gradient problem.

Key Takeaway
The sigmoid function is smooth and differentiable, making it useful for back-
propagation in neural networks. However, its derivative shows that it can
cause slow convergence for extreme values of x due to small gradients.

12 Patterns in Backward Flow of Gradients


When performing backpropagation in a neural network, gradients flow backward
through the network. Each type of operation (or gate) has a specific way of
affecting the flow of gradients. Below are three common types of gates and their
roles in gradient propagation:

Figure 20: Backward flow patterns

Neural Networks 22
Minor in AI

12.1 Add Gate: Gradient Distributor


• The add gate combines two or more inputs by summing them:
z = x1 + x2 .
• In backpropagation, the gradient of the cost function C with respect to z is
distributed equally to the inputs:
∂C ∂C ∂C ∂C
= , = .
∂x1 ∂z ∂x2 ∂z
• Role: The add gate acts as a gradient distributor, splitting the gradient
and sending it to all inputs.

12.2 Max Gate: Gradient Router


• The max gate selects the maximum value from two inputs:
z = max(x1 , x2 ).
• In backpropagation, the gradient of the cost function C with respect to z is
routed only to the input that contributed the maximum value:
( (
∂C ∂C
∂C ∂z
, if x1 > x2 , ∂C , if x2 > x1 ,
= = ∂z
∂x1 0, otherwise. ∂x2 0, otherwise.
• Role: The max gate acts as a gradient router, directing the gradient to the
input with the highest value.

12.3 Multiply (Mul) Gate: Gradient Switcher


• The multiply gate multiplies two inputs:
z = x1 ∗ x2 .
• In backpropagation, the gradient of the cost function C with respect to z is
split based on the values of the other input:
∂C ∂C ∂C ∂C
= · x2 , = · x1 .
∂x1 ∂z ∂x2 ∂z
• Role: The multiply gate acts as a gradient switcher, scaling the gradient
for one input by the value of the other input.

Summary of Patterns
1. Add Gate: Distributes gradients equally to all inputs.
2. Max Gate: Routes gradients only to the input that contributed the maximum
value.
3. Mul Gate: Switches or scales gradients between inputs based on their values.

These patterns help to understand how gradients flow through different parts of a
neural network, enabling efficient backpropagation.

Neural Networks 23
Minor in AI

13 Food for Thought

Figure 21: Left for You... Explore & Have Fun!!!

Neural Networks 24
Minor in AI

14 Additional Resources
[1] Fei-Fei Li, Lecture 4: Backpropagation and Neural Networks, Stanford
University ·
[2] Michael Nielsen, Neural Networks and Deep Learning, Chapter 2, Lambda ·
[3] Phillip Isola, Using and Understanding Deep Neural Nets, Tutorial Series, MIT
·
[4] M. Shifas PV, Introduction to Neural Networks and Application in Speech
Enhancement, Department of CSE, University of Crete ·

14.1 Partial Derivatives


What are Partial Derivatives?

• Partial derivatives represent the rate of change of a function with respect to


one variable while keeping other variables constant. For example, if f (x, y) is
a function of x and y, ∂f
∂x
measures how f changes as x changes, holding y
fixed.
• Visual Representation: Imagine a landscape where height represents the
function’s value. Moving along the x direction gives ∂f
∂x
, and moving along y
∂f
gives ∂y .

Figure 22: Geometry of Partial Derivatives

Interpretation and Real-World Context

• For a function f (x, y, z):


∂f
– ∂x
indicates how f changes as x changes, with y and z constant.

Neural Networks 25
Minor in AI

∂f ∂f
– ∂y
and ∂z
follow similarly.
• In economics, if f (x, y) represents profit as a function of production levels x
and y, ∂f
∂x
shows the change in profit when increasing x, keeping y constant.

Mathematical Definition

For a multivariable function f (x1 , x2 , . . . , xn ), the partial derivative with respect to


xk is:
∂f f (x1 , . . . , xk + ∆xk , . . . , xn ) − f (x1 , . . . , xk , . . . , xn )
= lim .
∂xk ∆xk →0 ∆xk
This measures the instantaneous rate of change of f concerning xk .

Gradient and Applications

The gradient ∇f of f (x1 , x2 , . . . , xn ) is:


 ∂f 
∂x1
 ∂f 
 ∂x2 
∇f =  .  .
 .. 
∂f
∂xn

It points in the direction of steepest increase and is crucial in optimization and


machine learning.

Examples

(a) Function Example: For f (x, y) = x2 + 3xy + y 2 , partial derivatives are:

∂f ∂f
= 2x + 3y, = 3x + 2y.
∂x ∂y
∂C ∂C
(b) Neural Networks: In gradient descent, partial derivatives ∂w
and ∂b
are
used to update weights w and biases b:
∂C ∂C
w ←w−η , b←b−η .
∂w ∂b

Geometric Interpretation

- A partial derivative ∂f
∂x
can be visualized as the slope of the tangent line to the
curve obtained by fixing all variables except x. - Similarly, ∂f
∂y
is the slope in the y
direction.

Neural Networks 26
Minor in AI

Key Takeaways

• Partial derivatives measure the sensitivity of functions to changes in specific


variables.
• They are essential for optimization, physics, economics, and machine learning.
• Techniques like gradient-based optimization heavily rely on partial
derivatives.

14.2 Forward Pass: Mathematical Representation


In mathematical terms, the computations for a single node k in layer ℓ are as
follows:

(a) Compute the weighted sum of inputs plus bias:


mℓ−1
(ℓ)
X (ℓ) (ℓ−1) (ℓ)
zk = wkj aj + bk
j=1

Where:
(ℓ)
• zk : Pre-activation value for node k in layer ℓ.
(ℓ)
• wkj : Weight connecting the j-th node in layer ℓ − 1 to node k in layer ℓ.
(ℓ−1)
• aj : Activation (output) of the j-th node in layer ℓ − 1.
(ℓ)
• bk :Bias term for node k in layer ℓ.
• mℓ−1 : Number of nodes in the previous layer (ℓ − 1).
(b) Apply the activation function:
 
(ℓ) (ℓ)
ak = f zk

Where f (·) is the activation function.

The above equations represent the forward pass for a single node in a neural
network. Extend this computation to all nodes in a layer, and subsequently, to all
layers in the network for a full forward pass.

Neural Networks 27

You might also like