0% found this document useful (0 votes)
14 views70 pages

Unit 2 DL

The document provides an overview of the basics of neural networks, covering key concepts such as operations, layers, building blocks, and training methods. It discusses the forward and backward passes, loss functions like SoftMax Cross Entropy, and techniques for optimizing learning, including momentum and weight initialization. Additionally, it highlights the importance of dropout as a regularization technique to prevent overfitting in deep learning models.

Uploaded by

23adl05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views70 pages

Unit 2 DL

The document provides an overview of the basics of neural networks, covering key concepts such as operations, layers, building blocks, and training methods. It discusses the forward and backward passes, loss functions like SoftMax Cross Entropy, and techniques for optimizing learning, including momentum and weight initialization. Additionally, it highlights the importance of dropout as a regularization technique to prevent overfitting in deep learning models.

Uploaded by

23adl05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Unit 2: BASICS OF NEURAL NETWORKS

Deep Learning: A First Pass – Neural Networks: Operations, Layers,


Building Blocks, Class – Trainer and Optimizer – Intuition – SoftMax
Cross Entropy Loss Function – Experiments – Momentum – Learning
Rate Decay – Weight Initialization – Dropout.
Deep Learning: A First Pass
• The purpose of Deep Learning model is to try to map inputs, each drawn
from some dataset with common characteristics to outputs
drawn from a related distribution.
• Define the model with parameters and fit the following
1. Repeatedly feed observations through the model, keeping track of the
quantities computed along the way during this “forward pass.”
2. Calculate a loss representing how far off our model’s predictions were
from the desired outputs or target.
3. Using the quantities computed on the forward pass and the chain rule,
compute how much each of the input parameters ultimately affects this
loss.
4. Update the values of the parameters so that the loss will hopefully be
reduced when the next set of observations is passed through the model.
Neural Networks: Operations
• Model should have forward and backward methods, each of which
receives an ndarray as an input and outputs an ndarray.
• Some operations, such as matrix multiplication, seem to have another
special kind of input, also an ndarray: the parameters.
• With parameter there exits two types of Operations:
• some, such as the matrix multiplication, return an ndarray as output that is a
different shape than the ndarray they received as input;
• by contrast, some Operations, such as the sigmoid function, simply apply
some function to each element of the input ndarray.
• Let’s consider the ndarrays passed through our Operations: each
Operation will send outputs forward on the forward pass and will
receive an “output gradient” on the backward pass, which will
represent the partial derivative of the loss with respect to every
element of the Operation’s output (computed by the other
Operations that make up the network).
• Also on the backward pass, each Operation will send an “input
gradient” backward, representing the partial derivative of the loss
with respect to each element of the input.
• Conditions:
• The shape of the output gradient ndarray must match the shape of
the output.
• The shape of the input gradient that the Operation sends backward
during the backward pass must match the shape of the Operation’s
input.

Neural Networks: Layers
• Layers are a series of linear operations followed by a nonlinear
operation.
• 5 operations in Perceptron: (Ref Fig below)
• Multiplication of input and weights and Addition of bias
• Sigmoid function (non-linear operation)
• In addition, we say that the input itself represents a special kind of
layer called the input layer.
• The last layer, similarly, is called the output layer.
• The middle layer—the “first one,” according to our numbering—also
has an important name: it is called a hidden layer.
• The output layer is an important, in that it does not have to have a
nonlinear operation applied to it.
Connection to the brain
• Each layer can be said to have a certain number of neurons equal to
the dimensionality of the vector that represents each observation in
the layer’s output.
• Ex: input 13 nodes, hidden 13 nodes and output 1 node.
• Neurons in the brain have the property that they can receive inputs
from many other neurons and will “fire” and send a signal forward.
• Neurons have a loosely analogous property: they do indeed send
signals forward based on their inputs, but the inputs are transformed
into outputs simply via a nonlinear function (activation function).
Neural Networks: Building Blocks
• The matrix multiplication of the input with the matrix of parameters
• The addition of a bias term
• The sigmoid activation function
Bias
Sigmoid
The Layer Blueprint
• The forward and backward methods simply involve sending the input
successively forward through a series of Operations
• Defining the correct series of Operations in the _setup_layer function and
initializing and storing the parameters in these Operations (which will also
take place in the _setup_layer function)
• Storing the correct values in self.input_ and self.output on the forward
method
• Performing the correct assertion checking in the backward method
• Finally, the _params and _param_grads functions simply extract the
parameters and their gradients (with respect to the loss) from the
ParamOperations within the layer.
The Dense Layer
• A defining characteristic of this layer is that each output neuron is a
function of all of the input neurons.
• That is what the matrix multiplication is really doing: if the matrix is
nin rows by nout columns, the multiplication itself is computing nout
new features, each of which is a weighted linear combination of all of
the nin input features.
• It is also called fully connected layer.
Neural Networks: Class
• At a high level, it should be able to learn from data: more precisely, it
should be able to take in batches of data representing “observations”
(X) and “correct answers” (y) and learn the relationship between X
and y, which means learning a function that can transform X into
predictions p that are very close to y.
• How exactly will this learning take place, given the Layer and
Operation classes
• Forward

• Backward

• Loss
Trainer and Optimizer
• Training
Optimizer
Intuition
• Neural networks contain a bunch of weights; given these weights,
along with some input data X and y, we can compute a resulting
“loss.”

• In reality, each individual weight has some complex, nonlinear


relationship with the features X, the target y, the other weights, and
ultimately the loss L.
• When we start to train neural networks, we initialize each
weight to have a value somewhere along the x-axis.
• Then, using the gradients we calculate during
backpropagation, we iteratively update the weight, with
our first update based on the slope of this curve at the
initial value we happened to choose.
• The goal of training a deep learning model is to move
each weight to the “global” value for which the loss is
minimized.
• If the steps we take are too small, we risk ending up in a
“local” minimum, which is less optimal than the global
one.
• If the steps are too large, we risk “repeatedly hopping
over” the global minimum, even if we are near it.
• The fundamental trade-off of tuning learning rates: if they are too
small, we can get stuck in a local minimum; if they are too large, they
can skip over the global minimum.
• There are thousands, if not millions, of weights in a neural network,
so we are searching for a global minimum in a space that has
thousands or millions of dimensions is more complicated
SoftMax Cross Entropy Loss Function
No. of nodes in Activation in
Loss Metrics Actual Value (y)
output layer output layer

RMSE,
Regression MSE, MAE 1 Linear -
R2 score

F1 score
Classification Categorical cross No. of category in Accuracy
Softmax One-hot Encoded
(Multi class) entropy target variable Recall
Precision

F1 score
Classification Accuracy
Binary cross entropy 1 Sigmoid Label Encoded
(Two class) Recall
Precision
SoftMax Cross Entropy Loss Function
• Mean Squared Error (MSE) works well for Regression problems.
• It turns out that in classification problems, we can do better than this,
since in such problems we know that the values our network outputs
should be interpreted as probabilities; thus, not only should each
value be between 0 and 1, but the vector of probabilities should sum
to 1 for each observation we have fed through our network.
• The softmax cross entropy loss function exploits this to produce
steeper gradients than the mean squared error loss for the same
inputs.
The Softmax Function
• For a classification problem with N possible classes, we’ll have our
neural network output a vector of N values for each observation. For
a problem with three classes, these values could, for example, be:
[5, 3, 2]
• Since we need probability values for vectors, we normalize to
• Softmax function for vectors are:
• Intuition

• Softmax calculator
• Output:
The Cross Entropy Loss
• It computes loss between actual and predicted probabilities
• The cross entropy loss function, for each index i in these vectors, is:

• Intuition
To see why this makes sense as a loss function, consider that since
every element of y is either 0 or 1, the preceding equation reduces to:
• SoftMax Cross Entropy (SCE):
Note on Activation Functions
• Was a nonlinear and monotonic function
• Provided a “regularizing” effect on the model, forcing the
intermediate features down to a finite range, specifically between 0
and 1
• The gradient that gets passed to the sigmoid function (or any
function) on the backward pass represents how much the function’s
output ultimately affects the loss; because the maximum slope of the
sigmoid function is 0.25, these gradients will at best be divided by 4
when sent backward to the previous operation in the model.
• Worse still, when the input to the sigmoid function is less than –2 or
greater than 2, the gradient those inputs receive will be almost 0,
since sigmoid(x) is almost flat at x = –2 or x = 2.
• Output: 0 to x • Output: -1 to 1
• Gradient: 0.5 • Gradient: 1
Experiments
• We’ll use the MNIST dataset, which consists of black and white
images of handwritten digits that are 28 × 28 pixels, with the value of
each pixel ranging from 0 (white) to 255 (black)
• dataset is predivided into a training set of 60,000 images and a
testing set of 10,000 additional images

Data Preprocessing

• one-hot encoding to transform our vectors representing the labels


into an ndarray of the same shape as the predictions

• scale our data to mean 0 and variance 1


• Since image is used
Model
• Output: 10 labels
• Output layer activation function is Sigmoid (provides probability)
• geometric mean of input (784) and our number of outputs (10): 89 ≈
sqrt(784 × 10)
• 2 layer neural network
Momentum
• “update rule” for our weights at each time step is the derivative of the loss
with respect to the weights and move the weights in the resulting correct
direction.
update = self.lr*kwargs['grad']
kwargs['param'] -= update
• Intuition for Momentum
• parameter’s value is continually updated in the same direction because the
loss continues to decrease with each iteration
• the value of the update at each time step would be analogous to the
parameter’s “velocity”.
• objects don’t instantaneously stop and change directions; that’s because
they have momentum
Implementation
• the parameter update at each time step will be a weighted average of
the parameter updates at past time steps
Math
• momentum parameter is μ, and the gradient at each time step is ∇t,
our weight update is: update = ∇t + μ × ∇t - 1 + μ2 × ∇t - 2 + ...
Code
• Our Optimizer will keep track of a separate quantity representing the
history of parameter updates in addition to just receiving a gradient
at each time step.
• How should we update velocity? It turns out we can use the
following steps:
1. Multiply it by the momentum parameter.
2. Add the gradient.
• This results in the velocity taking on the following values at each time
step, starting at t = 1:
1. ∇1
2. ∇2 + μ × ∇1
3. ∇3 + μ × (∇2 + μ × ∇1) = μ × ∇2 + μ2 × ∇1
SGD Momentum
• optimizer = SGDMomentum(lr=0.1, momentum=0.9)
Learning Rate Decay
• Optimizing learning rate
• Learning Rate Decay is a technique used to gradually decrease the
learning rate during the training process in deep learning. The idea
behind it is to start with a higher learning rate to make fast progress
early in training, and then reduce the learning rate as training
progresses to fine-tune the model and prevent overshooting the
optimal solution.
• Advantages:
• Fast initial learning
• Avoid overshooting
• Better convergence
Types
• Linear decay:

where N is the total number of epochs, starting and ending value,


t-time step
• exponential decay:
Weight Initialization
• Weight Initialization is a crucial step in deep learning to ensure that
neural networks train effectively. It involves assigning initial values to
the weights of the network before the training process begins. The
choice of weight initialization can significantly impact the model's
ability to converge, avoid vanishing or exploding gradients, and reach
a good solution faster.
Why is Weight Initialization Important?

Improper initialization of weights can lead to:


• Vanishing or exploding gradients: When gradients become too small
or too large, making training ineffective.
• Slow convergence: Poor initialization can make gradient-based
optimizers converge very slowly.
• Symmetry breaking: If all weights are initialized to the same value
(like zero), the network will lack the diversity needed to learn useful
features during training.
Common Weight Initialization Techniques
1. Zero Initialization:
• Method: All weights are initialized to zero.
• Problem: This leads to the same gradient for every neuron in a layer,
resulting in identical updates for all neurons, making the network unable to
learn properly.
2. Random Initialization:
• Method: Weights are initialized randomly from a uniform or normal
distribution.
• Problem: If the variance of the random values is too high, it can lead to
exploding gradients; if it's too low, it can lead to vanishing gradients.
3. Xavier (Glorot) Initialization:
• Method: Weights are initialized by sampling from a distribution with
zero mean and a variance that depends on the number of input and
output neurons.
• The goal is to maintain the variance of activations and gradients
throughout the network.
• Formula for Xavier initialization:

• Xavier initialization is widely used in sigmoid or hyperbolic tangent


(tanh) activation functions to keep the variance of activations stable
across layers.
4. He Initialization:
• Method: Similar to Xavier initialization but designed for rectified
linear units (ReLU) or variants like Leaky ReLU.
• Formula for He initialization
• Setup_layer function
Dropout
• Dropout is a regularization technique used in deep learning to
prevent overfitting. It involves randomly "dropping out" (i.e., setting
to zero) a subset of neurons during each forward and backward pass
of training. This forces the network to become more robust and
prevents it from relying too much on any single neuron.
• Dropout randomly disables a fraction of neurons during training,
which ensures that the network learns to distribute the learning
across all neurons instead of becoming overly reliant on a small
subset. This improves the network's ability to generalize to unseen
data.
How Dropout Works:
• During Training:
• For each batch, a random subset of neurons is "dropped out" by setting
their activations to zero.
• The neurons that are dropped out do not contribute to the forward pass or
the gradients during the backward pass.
• The dropout rate (often denoted as p) specifies the fraction of neurons to
drop. For example, if p=0.5, 50% of the neurons are randomly dropped out
in each iteration.
• During Testing:
• No neurons are dropped out during inference (testing). Instead, the output
of the neurons is scaled down by the same dropout rate p (i.e., the weights
are multiplied by 1−p) to compensate for the over-reliance on the active
neurons during training.
Mathematical Explanation:
• Let’s say hl is the output of layer l before applying dropout. Dropout
randomly masks out neurons according to the dropout rate p,
resulting in:
• hlʹ=Ml⊙hl
• Where:
• Ml is a binary mask generated from a Bernoulli distribution with
probability p,
• ⊙ denotes the element-wise product.
• During testing, the activations are scaled by 1−p to account for the
dropout applied during training: hl=(1−p)hl
• Dropout Rate:
• The dropout rate p determines the fraction of neurons to drop out.
Typical values of p range from 0.2 to 0.5. For example:p=0.5
Commonly used for fully connected layers.
• p=0.2 or p=0.3: Often used for convolutional layers, as convolutional
layers are usually less prone to overfitting.
When to Use Dropout:
• Dense layers: Dropout is commonly applied to fully connected layers,
especially in deep networks, to prevent overfitting.
• Convolutional layers: While less common, dropout can also be
applied to convolutional layers but typically with smaller rates (e.g.,
p=0.2).
• Recurrent layers (RNNs, LSTMs): Specialized versions of dropout (like
Variational Dropout) can be applied to recurrent networks.
from tensorflow.keras.layers import Dropout, Dense

model = Sequential()
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5)) # 50% of neurons will be dropped out
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5)) # Dropout applied again
model.add(Dense(10, activation='softmax'))

You might also like