Neural Network 1
Neural Network 1
Neural Networks
1 Predictive Modelling
Predictive modelling is about creating a mathematical function that predicts the out-
put (or target) from some input. Think of it as learning a function
f (input) = output
Figure 1: Workflow
Neural Networks 1
Minor in AI
2. Deep Learning is particularly effective when dealing with unstructured data like
images, audio, and text.
Figure 2: AI Sphere
Neural Networks 2
Minor in AI
4.2 Challenges
This task is challenging due to:
• Similar Features: Dogs and cats may share similar features such as fur patterns,
shapes, or poses, making it difficult for a model to distinguish between them.
4.3 Objective
The objective is to build a classification model that takes an image as input and outputs
a prediction (dog or cat) with high accuracy, even when faced with unseen or diverse
images.
4.4 ML Approach
• In ML, we extract features manually (e.g., the shape of ears, tail, or color patterns).
• These features are then used to train a model (e.g., SVM) to classify the animal as
a dog or a cat.
Figure 4: ML Workflow
4.5 DL Approach
• A DL model like Convolutional Neural Networks (CNNs) can process raw image
pixels, learn to identify unique features (e.g., fur texture or shape of the face), and
classify the image as ”dog” or ”cat.”
Neural Networks 3
Minor in AI
Figure 5: D Workflow
Neural Networks 4
Minor in AI
• Feature Importance: Not all features contribute equally to the classification. For
example:
– The presence of pointy ears might be more relevant for identifying cats.
– A broader face or specific fur patterns might indicate a dog.
Benefits
• Reduced Dependence on Domain Knowledge: Unlike traditional models,
CNNs can automatically learn relevant features, reducing the need for manual in-
tervention.
• Scalability: Feature learning in deep learning is highly scalable to large and diverse
datasets.
Figure 7: Modalities
Neural Networks 5
Minor in AI
• Example: For classifying animals, combining image data with text descriptions
could provide additional context to improve accuracy.
7 ML vs. DL
Figure 8: ML vs DL
• Data Dependency:
• Complexity:
Neural Networks 6
Minor in AI
2. Use DL for complex problems, especially with large datasets and unstructured data
like images or audio.
Neural Networks 7
Minor in AI
• Hours of Sleep: The number of hours a student sleeps before the exam.
• Hours of Study: The number of hours a student studies for the exam.
The goal is to estimate the exam score for students using these two inputs with the
help of a neural network.
Neural Networks 8
Minor in AI
8.1 Notations
• The network consists of L layers labeled as {1, 2, . . . , L}.
• Input Layer: The inputs are connected to the first hidden layer, denoted as
layer 1.
• Output Layer: The final layer, denoted as layer L, is the output layer.
Neural Networks 9
Minor in AI
1. Combine Inputs: Multiply each input by a weight, add them together, and
include a bias term. Think of this as finding the ”weighted importance” of all
the inputs, plus a baseline adjustment.
2. Apply Activation Function: Take the combined value from step 1 and
pass it through an activation function (e.g., squashing it between 0 and 1 for
a sigmoid function). This gives the node’s output.
Step-by-Step Workflow
1. Inputs: The inputs to the network are:
3 5
X = 5 1
10 2
Each row represents the hours of sleep and hours of study for a student.
2. Hidden Layer Calculations: In the hidden layer, each neuron processes inputs
from the previous layer. To calculate the input to a neuron, each input value is
multiplied by a corresponding weight, summed together, and then a bias term is
added. This produces the pre-activation value:
2
X
zj2 = 1
wij xi + b1j
i=1
a2j = f (zj2 )
Neural Networks 10
Minor in AI
3. Output Layer Calculations: In the output layer, the outputs from the hidden
layer are processed similarly. Each output from the hidden layer is multiplied by a
new set of weights, summed together, and a bias term is added. This produces the
pre-activation value for the output layer:
2
X
3
z = wj2 a2j + b2
j=1
The pre-activation value is then passed through an activation function f (·) to pro-
duce the final prediction:
ŷ = f (z 3 )
4. Output: The final value, ŷ, represents the predicted exam score, which is the
output of the network.
9 Learning of Parameters
9.1 Objective
(ℓ) (ℓ)
The primary goal is to find the optimal values for all the weights wkj and biases bj in
a neural network. This is achieved using the gradient descent algorithm, which
minimizes a defined cost function, denoted as C. The process involves calculating
partial derivatives of C with respect to weights and biases:
∂C ∂C
(ℓ)
and (ℓ)
.
∂wkj ∂bj
Neural Networks 11
Minor in AI
where n is the total number of training samples. For example, in the case of
the mean sum-squared error, the total cost is:
n
1X (L)
C= (yi − aj )2 .
n i=1
Every hidden node and output has 2 values: Net Value (z) and Output
Value (a)
Neural Networks 12
Minor in AI
(a) Start with the input vector x = [x1 , x2 , x3 , . . . ], which contains the values of
the input features.
(b) For the first hidden layer (Layer 1), calculate the weighted sum of the inputs,
adding a bias term. This is done for each node k in Layer 1:
X
zk1 = 1
wkj xj + b1k ,
j
where:
• zk1 is the weighted sum for node k in Layer 1 (the pre-activation value),
• wkj1
is the weight connecting input xj to node k in Layer 1,
• bk is the bias term for node k in Layer 1.
1
(c) Apply an activation function f (·) (e.g., sigmoid, ReLU) to the pre-activation
value zk1 to compute the activation (output) of the node:
a1k = f (zk1 ),
ŷ = aLk ,
9.5 Backpropagation
The process of backpropagation involves the calculation of gradients for updating the
weights in a neural network. The diagram in Figure 14 illustrates the computation
of gradients step-by-step:
∂z ∂z
Local Gradient: ,
∂x ∂y
Neural Networks 13
Minor in AI
Propagation of Gradients:
• ∂C
∂z
:The gradient of the cost function C with respect to the output z of the
function f .
• ∂z
∂x
: The local gradient, representing how z changes with x.
• ∂z
∂y
: The local gradient, representing how z changes with y.
Neural Networks 14
Minor in AI
Using the chain rule, we can compute the gradients by working backwards from
the output layer to the earlier layers. Specifically, the gradient of the cost function
(ℓ)
C with respect to the pre-activation value zj at node j in layer ℓ is given by the
following formula:
First, calculate the contribution of each node k in the next layer (ℓ + 1) to the
gradient at node j in the current layer. This is done by multiplying the gradient
∂C
(ℓ+1) of the cost with respect to the pre-activation value at node k in layer (ℓ + 1)
∂zk
(ℓ+1)
by the weight wkj that connects node k in layer (ℓ + 1) to node j in layer ℓ. We
(ℓ)
then multiply by the derivative of the activation function f ′ (zj ) at node j in layer
ℓ. The final expression for the gradient is:
∂C X ∂C (ℓ+1) ′ (ℓ)
(ℓ)
= (ℓ+1)
wkj f (zj ),
∂zj k ∂zk
(ℓ)
where f ′ (zj ) is the derivative of the activation function applied at node j in layer
ℓ.
Weight and Bias Updates: Once we have computed the gradients, we update
(ℓ) (ℓ)
the weights and biases using gradient descent. The weights wkj and biases bj are
adjusted in the direction that reduces the cost function C. The updates are
performed as follows:
(ℓ)
To update the weight wkj , subtract the product of the learning rate η and the
gradient of the cost with respect to the weight:
(ℓ) (ℓ) ∂C
wkj ← wkj − η (ℓ)
,
∂wkj
(ℓ)
Similarly, to update the bias bj , subtract the product of the learning rate η and
the gradient of the cost with respect to the bias:
(ℓ) (ℓ) ∂C
bj ← bj − η (ℓ)
,
∂bj
Neural Networks 15
Minor in AI
where η is the learning rate, which determines the size of the step taken in the
direction of the negative gradient.
• The forward pass computes activations layer by layer, from input to output.
• The backward pass calculates the gradients of the cost function with respect
to the weights and biases, propagating errors backward through the network.
• Finally, weights and biases are updated iteratively to minimize the cost
function and optimize the model.
This process ensures that the neural network learns to make accurate predictions
by reducing the error over time.
10 A Simple Example
We aim to calculate the partial derivatives of the function f (x, y, z) with respect
to x, y, and z, using the chain rule.
Neural Networks 16
Minor in AI
11 Another Example
We aim to compute the gradient of the function f (w, x), which is defined as:
1
f (w, x) =
1+ e−(w0 x0 +w1 x1 +w2 )
The weights and inputs are:
w0 = 2.00, x0 = −1.00, w1 = −3.00, x1 = −2.00, w2 = −3.00
Neural Networks 17
Minor in AI
Derivative Rules
Key derivative rules used in this computation are:
df
f (x) = ex → = ex ,
dx
1 df 1
f (x) = → = − 2,
x dx x
df
fa (x) = ax → = a,
dx
df
fc (x) = c + x → = 1.
dx
Neural Networks 18
Minor in AI
The gradients flow through the weighted connections to x0 and x1 . The partial
derivatives are:
∂z ∂z
= w0 , = w1 .
∂x0 ∂x1
Using the chain rule:
∂f ∂f ∂z
= · = 0.1971 · 2.00 = 0.3942,
∂x0 ∂z ∂x0
∂f ∂f ∂z
= · = 0.1971 · (−3.00) = −0.5913.
∂x1 ∂z ∂x1
Neural Networks 19
Minor in AI
Final Results
Sigmoid Function
The sigmoid function is a widely used activation function in machine learning,
particularly in neural networks. Mathematically, it is defined as:
1
σ(x) =
1 + e−x
This function maps any real-valued number to the range (0, 1), which is useful
for representing probabilities or normalized outputs.
Neural Networks 20
Minor in AI
σ(x) ∈ (0, 1) ∀x ∈ R.
Derivative
The derivative of the sigmoid function is essential for optimization algorithms
like gradient descent. Let us compute it:
1
σ(x) =
1 + e−x
Let y = σ(x). Then:
1
y=
1 + e−x
Taking the derivative with respect to x, we apply the chain rule:
dσ(x) d 1
=
dx dx 1 + e−x
Neural Networks 21
Minor in AI
So:
e−x
1 − σ(x) =
1 + e−x
Substitute σ(x) and 1 − σ(x) back into the derivative:
dσ(x)
= σ(x) · (1 − σ(x))
dx
Interpretation
• The derivative of the sigmoid function reaches its maximum value when
σ(x) = 0.5, which occurs at x = 0.
Key Takeaway
The sigmoid function is smooth and differentiable, making it useful for back-
propagation in neural networks. However, its derivative shows that it can
cause slow convergence for extreme values of x due to small gradients.
Neural Networks 22
Minor in AI
Summary of Patterns
1. Add Gate: Distributes gradients equally to all inputs.
2. Max Gate: Routes gradients only to the input that contributed the maximum
value.
3. Mul Gate: Switches or scales gradients between inputs based on their values.
These patterns help to understand how gradients flow through different parts of a
neural network, enabling efficient backpropagation.
Neural Networks 23
Minor in AI
Neural Networks 24
Minor in AI
14 Additional Resources
[1] Fei-Fei Li, Lecture 4: Backpropagation and Neural Networks, Stanford
University ·
[2] Michael Nielsen, Neural Networks and Deep Learning, Chapter 2, Lambda ·
[3] Phillip Isola, Using and Understanding Deep Neural Nets, Tutorial Series, MIT
·
[4] M. Shifas PV, Introduction to Neural Networks and Application in Speech
Enhancement, Department of CSE, University of Crete ·
Neural Networks 25
Minor in AI
∂f ∂f
– ∂y
and ∂z
follow similarly.
• In economics, if f (x, y) represents profit as a function of production levels x
and y, ∂f
∂x
shows the change in profit when increasing x, keeping y constant.
Mathematical Definition
Examples
∂f ∂f
= 2x + 3y, = 3x + 2y.
∂x ∂y
∂C ∂C
(b) Neural Networks: In gradient descent, partial derivatives ∂w
and ∂b
are
used to update weights w and biases b:
∂C ∂C
w ←w−η , b←b−η .
∂w ∂b
Geometric Interpretation
- A partial derivative ∂f
∂x
can be visualized as the slope of the tangent line to the
curve obtained by fixing all variables except x. - Similarly, ∂f
∂y
is the slope in the y
direction.
Neural Networks 26
Minor in AI
Key Takeaways
Where:
(ℓ)
• zk : Pre-activation value for node k in layer ℓ.
(ℓ)
• wkj : Weight connecting the j-th node in layer ℓ − 1 to node k in layer ℓ.
(ℓ−1)
• aj : Activation (output) of the j-th node in layer ℓ − 1.
(ℓ)
• bk :Bias term for node k in layer ℓ.
• mℓ−1 : Number of nodes in the previous layer (ℓ − 1).
(b) Apply the activation function:
(ℓ) (ℓ)
ak = f zk
The above equations represent the forward pass for a single node in a neural
network. Extend this computation to all nodes in a layer, and subsequently, to all
layers in the network for a full forward pass.
Neural Networks 27