Unit 2 DL

The document provides an overview of the basics of neural networks, covering key concepts such as operations, layers, building blocks, and training methods. It discusses the forward and backward passes, loss functions like SoftMax Cross Entropy, and techniques for optimizing learning, including momentum and weight initialization. Additionally, it highlights the importance of dropout as a regularization technique to prevent overfitting in deep learning models.

Uploaded by

23adl05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views70 pages

Unit 2 DL

Uploaded by

23adl05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

Unit 2: BASICS OF NEURAL NETWORKS

Deep Learning: A First Pass – Neural Networks: Operations, Layers,

Building Blocks, Class – Trainer and Optimizer – Intuition – SoftMax
Cross Entropy Loss Function – Experiments – Momentum – Learning
Rate Decay – Weight Initialization – Dropout.
Deep Learning: A First Pass
• The purpose of Deep Learning model is to try to map inputs, each drawn
from some dataset with common characteristics to outputs
drawn from a related distribution.
• Define the model with parameters and fit the following
1. Repeatedly feed observations through the model, keeping track of the
quantities computed along the way during this “forward pass.”
2. Calculate a loss representing how far off our model’s predictions were
from the desired outputs or target.
3. Using the quantities computed on the forward pass and the chain rule,
compute how much each of the input parameters ultimately affects this
loss.
4. Update the values of the parameters so that the loss will hopefully be
reduced when the next set of observations is passed through the model.
Neural Networks: Operations
• Model should have forward and backward methods, each of which
receives an ndarray as an input and outputs an ndarray.
• Some operations, such as matrix multiplication, seem to have another
special kind of input, also an ndarray: the parameters.
• With parameter there exits two types of Operations:
• some, such as the matrix multiplication, return an ndarray as output that is a
different shape than the ndarray they received as input;
• by contrast, some Operations, such as the sigmoid function, simply apply
some function to each element of the input ndarray.
• Let’s consider the ndarrays passed through our Operations: each
Operation will send outputs forward on the forward pass and will
receive an “output gradient” on the backward pass, which will
represent the partial derivative of the loss with respect to every
element of the Operation’s output (computed by the other
Operations that make up the network).
• Also on the backward pass, each Operation will send an “input
gradient” backward, representing the partial derivative of the loss
with respect to each element of the input.
• Conditions:
• The shape of the output gradient ndarray must match the shape of
the output.
• The shape of the input gradient that the Operation sends backward
during the backward pass must match the shape of the Operation’s
input.
•
Neural Networks: Layers
• Layers are a series of linear operations followed by a nonlinear
operation.
• 5 operations in Perceptron: (Ref Fig below)
• Multiplication of input and weights and Addition of bias
• Sigmoid function (non-linear operation)
• In addition, we say that the input itself represents a special kind of
layer called the input layer.
• The last layer, similarly, is called the output layer.
• The middle layer—the “first one,” according to our numbering—also
has an important name: it is called a hidden layer.
• The output layer is an important, in that it does not have to have a
nonlinear operation applied to it.
Connection to the brain
• Each layer can be said to have a certain number of neurons equal to
the dimensionality of the vector that represents each observation in
the layer’s output.
• Ex: input 13 nodes, hidden 13 nodes and output 1 node.
• Neurons in the brain have the property that they can receive inputs
from many other neurons and will “fire” and send a signal forward.
• Neurons have a loosely analogous property: they do indeed send
signals forward based on their inputs, but the inputs are transformed
into outputs simply via a nonlinear function (activation function).
Neural Networks: Building Blocks
• The matrix multiplication of the input with the matrix of parameters
• The addition of a bias term
• The sigmoid activation function
Bias
Sigmoid
The Layer Blueprint
• The forward and backward methods simply involve sending the input
successively forward through a series of Operations
• Defining the correct series of Operations in the _setup_layer function and
initializing and storing the parameters in these Operations (which will also
take place in the _setup_layer function)
• Storing the correct values in self.input_ and self.output on the forward
method
• Performing the correct assertion checking in the backward method
• Finally, the _params and _param_grads functions simply extract the
parameters and their gradients (with respect to the loss) from the
ParamOperations within the layer.
The Dense Layer
• A defining characteristic of this layer is that each output neuron is a
function of all of the input neurons.
• That is what the matrix multiplication is really doing: if the matrix is
nin rows by nout columns, the multiplication itself is computing nout
new features, each of which is a weighted linear combination of all of
the nin input features.
• It is also called fully connected layer.
Neural Networks: Class
• At a high level, it should be able to learn from data: more precisely, it
should be able to take in batches of data representing “observations”
(X) and “correct answers” (y) and learn the relationship between X
and y, which means learning a function that can transform X into
predictions p that are very close to y.
• How exactly will this learning take place, given the Layer and
Operation classes
• Forward

• Backward

• Loss
Trainer and Optimizer
• Training
Optimizer
Intuition
• Neural networks contain a bunch of weights; given these weights,
along with some input data X and y, we can compute a resulting
“loss.”

• In reality, each individual weight has some complex, nonlinear

relationship with the features X, the target y, the other weights, and
ultimately the loss L.
• When we start to train neural networks, we initialize each
weight to have a value somewhere along the x-axis.
• Then, using the gradients we calculate during
backpropagation, we iteratively update the weight, with
our first update based on the slope of this curve at the
initial value we happened to choose.
• The goal of training a deep learning model is to move
each weight to the “global” value for which the loss is
minimized.
• If the steps we take are too small, we risk ending up in a
“local” minimum, which is less optimal than the global
one.
• If the steps are too large, we risk “repeatedly hopping
over” the global minimum, even if we are near it.
• The fundamental trade-off of tuning learning rates: if they are too
small, we can get stuck in a local minimum; if they are too large, they
can skip over the global minimum.
• There are thousands, if not millions, of weights in a neural network,
so we are searching for a global minimum in a space that has
thousands or millions of dimensions is more complicated
SoftMax Cross Entropy Loss Function
No. of nodes in Activation in
Loss Metrics Actual Value (y)
output layer output layer

RMSE,
Regression MSE, MAE 1 Linear -
R2 score

F1 score
Classification Categorical cross No. of category in Accuracy
Softmax One-hot Encoded
(Multi class) entropy target variable Recall
Precision

F1 score
Classification Accuracy
Binary cross entropy 1 Sigmoid Label Encoded
(Two class) Recall
Precision
SoftMax Cross Entropy Loss Function
• Mean Squared Error (MSE) works well for Regression problems.
• It turns out that in classification problems, we can do better than this,
since in such problems we know that the values our network outputs
should be interpreted as probabilities; thus, not only should each
value be between 0 and 1, but the vector of probabilities should sum
to 1 for each observation we have fed through our network.
• The softmax cross entropy loss function exploits this to produce
steeper gradients than the mean squared error loss for the same
inputs.
The Softmax Function
• For a classification problem with N possible classes, we’ll have our
neural network output a vector of N values for each observation. For
a problem with three classes, these values could, for example, be:
[5, 3, 2]
• Since we need probability values for vectors, we normalize to
• Softmax function for vectors are:
• Intuition

• Softmax calculator
• Output:
The Cross Entropy Loss
• It computes loss between actual and predicted probabilities
• The cross entropy loss function, for each index i in these vectors, is:

• Intuition
To see why this makes sense as a loss function, consider that since
every element of y is either 0 or 1, the preceding equation reduces to:
• SoftMax Cross Entropy (SCE):
Note on Activation Functions
• Was a nonlinear and monotonic function
• Provided a “regularizing” effect on the model, forcing the
intermediate features down to a finite range, specifically between 0
and 1
• The gradient that gets passed to the sigmoid function (or any
function) on the backward pass represents how much the function’s
output ultimately affects the loss; because the maximum slope of the
sigmoid function is 0.25, these gradients will at best be divided by 4
when sent backward to the previous operation in the model.
• Worse still, when the input to the sigmoid function is less than –2 or
greater than 2, the gradient those inputs receive will be almost 0,
since sigmoid(x) is almost flat at x = –2 or x = 2.
• Output: 0 to x • Output: -1 to 1
• Gradient: 0.5 • Gradient: 1
Experiments
• We’ll use the MNIST dataset, which consists of black and white
images of handwritten digits that are 28 × 28 pixels, with the value of
each pixel ranging from 0 (white) to 255 (black)
• dataset is predivided into a training set of 60,000 images and a
testing set of 10,000 additional images
•
Data Preprocessing

• one-hot encoding to transform our vectors representing the labels

into an ndarray of the same shape as the predictions

• scale our data to mean 0 and variance 1

• Since image is used
Model
• Output: 10 labels
• Output layer activation function is Sigmoid (provides probability)
• geometric mean of input (784) and our number of outputs (10): 89 ≈
sqrt(784 × 10)
• 2 layer neural network
Momentum
• “update rule” for our weights at each time step is the derivative of the loss
with respect to the weights and move the weights in the resulting correct
direction.
update = self.lr*kwargs['grad']
kwargs['param'] -= update
• Intuition for Momentum
• parameter’s value is continually updated in the same direction because the
loss continues to decrease with each iteration
• the value of the update at each time step would be analogous to the
parameter’s “velocity”.
• objects don’t instantaneously stop and change directions; that’s because
they have momentum
Implementation
• the parameter update at each time step will be a weighted average of
the parameter updates at past time steps
Math
• momentum parameter is μ, and the gradient at each time step is ∇t,
our weight update is: update = ∇t + μ × ∇t - 1 + μ2 × ∇t - 2 + ...
Code
• Our Optimizer will keep track of a separate quantity representing the
history of parameter updates in addition to just receiving a gradient
at each time step.
• How should we update velocity? It turns out we can use the
following steps:
1. Multiply it by the momentum parameter.
2. Add the gradient.
• This results in the velocity taking on the following values at each time
step, starting at t = 1:
1. ∇1
2. ∇2 + μ × ∇1
3. ∇3 + μ × (∇2 + μ × ∇1) = μ × ∇2 + μ2 × ∇1
SGD Momentum
• optimizer = SGDMomentum(lr=0.1, momentum=0.9)
Learning Rate Decay
• Optimizing learning rate
• Learning Rate Decay is a technique used to gradually decrease the
learning rate during the training process in deep learning. The idea
behind it is to start with a higher learning rate to make fast progress
early in training, and then reduce the learning rate as training
progresses to fine-tune the model and prevent overshooting the
optimal solution.
• Advantages:
• Fast initial learning
• Avoid overshooting
• Better convergence
Types
• Linear decay:

where N is the total number of epochs, starting and ending value,

t-time step
• exponential decay:
Weight Initialization
• Weight Initialization is a crucial step in deep learning to ensure that
neural networks train effectively. It involves assigning initial values to
the weights of the network before the training process begins. The
choice of weight initialization can significantly impact the model's
ability to converge, avoid vanishing or exploding gradients, and reach
a good solution faster.
Why is Weight Initialization Important?

Improper initialization of weights can lead to:

• Vanishing or exploding gradients: When gradients become too small
or too large, making training ineffective.
• Slow convergence: Poor initialization can make gradient-based
optimizers converge very slowly.
• Symmetry breaking: If all weights are initialized to the same value
(like zero), the network will lack the diversity needed to learn useful
features during training.
Common Weight Initialization Techniques
1. Zero Initialization:
• Method: All weights are initialized to zero.
• Problem: This leads to the same gradient for every neuron in a layer,
resulting in identical updates for all neurons, making the network unable to
learn properly.
2. Random Initialization:
• Method: Weights are initialized randomly from a uniform or normal
distribution.
• Problem: If the variance of the random values is too high, it can lead to
exploding gradients; if it's too low, it can lead to vanishing gradients.
3. Xavier (Glorot) Initialization:
• Method: Weights are initialized by sampling from a distribution with
zero mean and a variance that depends on the number of input and
output neurons.
• The goal is to maintain the variance of activations and gradients
throughout the network.
• Formula for Xavier initialization:

• Xavier initialization is widely used in sigmoid or hyperbolic tangent

(tanh) activation functions to keep the variance of activations stable
across layers.
4. He Initialization:
• Method: Similar to Xavier initialization but designed for rectified
linear units (ReLU) or variants like Leaky ReLU.
• Formula for He initialization
• Setup_layer function
Dropout
• Dropout is a regularization technique used in deep learning to
prevent overfitting. It involves randomly "dropping out" (i.e., setting
to zero) a subset of neurons during each forward and backward pass
of training. This forces the network to become more robust and
prevents it from relying too much on any single neuron.
• Dropout randomly disables a fraction of neurons during training,
which ensures that the network learns to distribute the learning
across all neurons instead of becoming overly reliant on a small
subset. This improves the network's ability to generalize to unseen
data.
How Dropout Works:
• During Training:
• For each batch, a random subset of neurons is "dropped out" by setting
their activations to zero.
• The neurons that are dropped out do not contribute to the forward pass or
the gradients during the backward pass.
• The dropout rate (often denoted as p) specifies the fraction of neurons to
drop. For example, if p=0.5, 50% of the neurons are randomly dropped out
in each iteration.
• During Testing:
• No neurons are dropped out during inference (testing). Instead, the output
of the neurons is scaled down by the same dropout rate p (i.e., the weights
are multiplied by 1−p) to compensate for the over-reliance on the active
neurons during training.
Mathematical Explanation:
• Let’s say hl is the output of layer l before applying dropout. Dropout
randomly masks out neurons according to the dropout rate p,
resulting in:
• hlʹ=Ml⊙hl
• Where:
• Ml is a binary mask generated from a Bernoulli distribution with
probability p,
• ⊙ denotes the element-wise product.
• During testing, the activations are scaled by 1−p to account for the
dropout applied during training: hl=(1−p)hl
• Dropout Rate:
• The dropout rate p determines the fraction of neurons to drop out.
Typical values of p range from 0.2 to 0.5. For example:p=0.5
Commonly used for fully connected layers.
• p=0.2 or p=0.3: Often used for convolutional layers, as convolutional
layers are usually less prone to overfitting.
When to Use Dropout:
• Dense layers: Dropout is commonly applied to fully connected layers,
especially in deep networks, to prevent overfitting.
• Convolutional layers: While less common, dropout can also be
applied to convolutional layers but typically with smaller rates (e.g.,
p=0.2).
• Recurrent layers (RNNs, LSTMs): Specialized versions of dropout (like
Variational Dropout) can be applied to recurrent networks.
from tensorflow.keras.layers import Dropout, Dense

model = Sequential()
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5)) # 50% of neurons will be dropped out
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5)) # Dropout applied again
model.add(Dense(10, activation='softmax'))

Performance Task 1 - Attempt Review - WEEK1-9
No ratings yet
Performance Task 1 - Attempt Review - WEEK1-9
6 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
AML 03 Dense Neural Networks
No ratings yet
AML 03 Dense Neural Networks
20 pages
Chapter 5 Final
No ratings yet
Chapter 5 Final
80 pages
UNIT-2 Machine Learning
No ratings yet
UNIT-2 Machine Learning
35 pages
006-Multiple Layers DNN
No ratings yet
006-Multiple Layers DNN
26 pages
HODL Lec 2 Training NNs Intro TF
No ratings yet
HODL Lec 2 Training NNs Intro TF
83 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
Neural Networks
No ratings yet
Neural Networks
10 pages
Introduction Deep Eng
No ratings yet
Introduction Deep Eng
50 pages
Unit I
No ratings yet
Unit I
90 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
5 pages
MLfromBasics Ch2E
No ratings yet
MLfromBasics Ch2E
32 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
Artificial Neural Network Notes
No ratings yet
Artificial Neural Network Notes
9 pages
AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
Unit 2
No ratings yet
Unit 2
18 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Week 2 Artificial Neural Networks
No ratings yet
Week 2 Artificial Neural Networks
62 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
Unit Ii DNN
No ratings yet
Unit Ii DNN
24 pages
Neural Networks Essay Feranmi Dere
No ratings yet
Neural Networks Essay Feranmi Dere
7 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
ML Unit 4
No ratings yet
ML Unit 4
23 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
DL Unit2
No ratings yet
DL Unit2
113 pages
NN Unit - 1
No ratings yet
NN Unit - 1
27 pages
DL M2 Tech
No ratings yet
DL M2 Tech
32 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
Machine Learning (CSO851) - Lecture 08
No ratings yet
Machine Learning (CSO851) - Lecture 08
27 pages
Neural Networks
No ratings yet
Neural Networks
63 pages
Neural Networks
No ratings yet
Neural Networks
37 pages
Unit V
No ratings yet
Unit V
9 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Introduction To Artificial Neural Networks
No ratings yet
Introduction To Artificial Neural Networks
31 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
ML.8-Neural Networks - Deep Learning (Week 12,13)
No ratings yet
ML.8-Neural Networks - Deep Learning (Week 12,13)
80 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
1 Intro
No ratings yet
1 Intro
91 pages
Understanding and Creating Neural Networks
No ratings yet
Understanding and Creating Neural Networks
69 pages
MSCDA 605 Machine Learning Exam Model Answers May - 2019
No ratings yet
MSCDA 605 Machine Learning Exam Model Answers May - 2019
7 pages
Week 14 (NN)
No ratings yet
Week 14 (NN)
49 pages
Ad3451 ML Unit 4 Notes
No ratings yet
Ad3451 ML Unit 4 Notes
36 pages
Deep Learning Tutorial 9
No ratings yet
Deep Learning Tutorial 9
70 pages
Unit 2 - Machine Learning
No ratings yet
Unit 2 - Machine Learning
19 pages
Module 2
No ratings yet
Module 2
44 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
SHAI - Task 3 - NN
No ratings yet
SHAI - Task 3 - NN
10 pages
UNIT-2 Foundations of Deep Learning
No ratings yet
UNIT-2 Foundations of Deep Learning
64 pages
Lesson 7.0 Supervised Learning With Neural Networks
No ratings yet
Lesson 7.0 Supervised Learning With Neural Networks
22 pages
NN Concepts
No ratings yet
NN Concepts
4 pages
Machine Learning With Artificial Neural Networks
No ratings yet
Machine Learning With Artificial Neural Networks
44 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
From Everand
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
Fouad Sabry
No ratings yet
Bangla Hand Written Digit Recognition
No ratings yet
Bangla Hand Written Digit Recognition
19 pages
Power System Planning and Operation Using Artificial Neural Networks
No ratings yet
Power System Planning and Operation Using Artificial Neural Networks
6 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Tennis Winner Prediction Based On Time-Series
No ratings yet
Tennis Winner Prediction Based On Time-Series
6 pages
CV Main
No ratings yet
CV Main
1 page
Unit - 5.1 - Introduction To Machine Learning
No ratings yet
Unit - 5.1 - Introduction To Machine Learning
38 pages
Artificial Intelligence in Sports Predic
No ratings yet
Artificial Intelligence in Sports Predic
4 pages
Analogy Between CNN and RNN Using MNIST Dataset: Prof. Rathi R Assistant Professor Sr. Grade 1
No ratings yet
Analogy Between CNN and RNN Using MNIST Dataset: Prof. Rathi R Assistant Professor Sr. Grade 1
21 pages
UGRD-AI6100 AI Prompt Engineering Lab
No ratings yet
UGRD-AI6100 AI Prompt Engineering Lab
28 pages
Final Exam Paper Fall 2020
No ratings yet
Final Exam Paper Fall 2020
3 pages
For Fake or Real Disaster Tweet Analysis of Machine Learning Algorithms
No ratings yet
For Fake or Real Disaster Tweet Analysis of Machine Learning Algorithms
23 pages
Complete Ethics of Artificial Intelligence S. Matthew Liao PDF For All Chapters
No ratings yet
Complete Ethics of Artificial Intelligence S. Matthew Liao PDF For All Chapters
65 pages
FRSC 05 1253627
No ratings yet
FRSC 05 1253627
28 pages
CGAN
No ratings yet
CGAN
13 pages
Delta Rule
No ratings yet
Delta Rule
3 pages
ITM Universe, Vadodara Deparment of Computer Science and Engineering Project I List - Batch 2016 - 20
No ratings yet
ITM Universe, Vadodara Deparment of Computer Science and Engineering Project I List - Batch 2016 - 20
8 pages
Online DPO
No ratings yet
Online DPO
18 pages
The Advantages and Disadvantages of AI
No ratings yet
The Advantages and Disadvantages of AI
1 page
Aust Cse Thesis Final Book
No ratings yet
Aust Cse Thesis Final Book
72 pages
Ai 900
No ratings yet
Ai 900
4 pages
Essay 1
No ratings yet
Essay 1
2 pages
Artificial Intelligence - Master's and PHD Degree Programmes - University of Groningen
No ratings yet
Artificial Intelligence - Master's and PHD Degree Programmes - University of Groningen
7 pages
Unlocking The Power of Machine Learning
No ratings yet
Unlocking The Power of Machine Learning
10 pages
Machine Learning 1707965934
No ratings yet
Machine Learning 1707965934
15 pages
Week 12
No ratings yet
Week 12
9 pages
Matlab Program Codes For Bidirectional Associative Memory Networks
No ratings yet
Matlab Program Codes For Bidirectional Associative Memory Networks
4 pages
人工智能前沿专题大语言模型基础导论研究生课程 Honggang Zhang 2025
No ratings yet
人工智能前沿专题大语言模型基础导论研究生课程 Honggang Zhang 2025
233 pages
Introduction To ChatGPT
No ratings yet
Introduction To ChatGPT
23 pages
BT Seai ML DL Prev Question Papers
No ratings yet
BT Seai ML DL Prev Question Papers
11 pages

Unit 2 DL

Uploaded by

Unit 2 DL

Uploaded by

Unit 2: BASICS OF NEURAL NETWORKS

Deep Learning: A First Pass – Neural Networks: Operations, Layers,

• In reality, each individual weight has some complex, nonlinear

• one-hot encoding to transform our vectors representing the labels

• scale our data to mean 0 and variance 1

where N is the total number of epochs, starting and ending value,

Improper initialization of weights can lead to:

• Xavier initialization is widely used in sigmoid or hyperbolic tangent

You might also like