Unit 2 DL
Unit 2 DL
• Backward
• Loss
Trainer and Optimizer
• Training
Optimizer
Intuition
• Neural networks contain a bunch of weights; given these weights,
along with some input data X and y, we can compute a resulting
“loss.”
RMSE,
Regression MSE, MAE 1 Linear -
R2 score
F1 score
Classification Categorical cross No. of category in Accuracy
Softmax One-hot Encoded
(Multi class) entropy target variable Recall
Precision
F1 score
Classification Accuracy
Binary cross entropy 1 Sigmoid Label Encoded
(Two class) Recall
Precision
SoftMax Cross Entropy Loss Function
• Mean Squared Error (MSE) works well for Regression problems.
• It turns out that in classification problems, we can do better than this,
since in such problems we know that the values our network outputs
should be interpreted as probabilities; thus, not only should each
value be between 0 and 1, but the vector of probabilities should sum
to 1 for each observation we have fed through our network.
• The softmax cross entropy loss function exploits this to produce
steeper gradients than the mean squared error loss for the same
inputs.
The Softmax Function
• For a classification problem with N possible classes, we’ll have our
neural network output a vector of N values for each observation. For
a problem with three classes, these values could, for example, be:
[5, 3, 2]
• Since we need probability values for vectors, we normalize to
• Softmax function for vectors are:
• Intuition
• Softmax calculator
• Output:
The Cross Entropy Loss
• It computes loss between actual and predicted probabilities
• The cross entropy loss function, for each index i in these vectors, is:
• Intuition
To see why this makes sense as a loss function, consider that since
every element of y is either 0 or 1, the preceding equation reduces to:
• SoftMax Cross Entropy (SCE):
Note on Activation Functions
• Was a nonlinear and monotonic function
• Provided a “regularizing” effect on the model, forcing the
intermediate features down to a finite range, specifically between 0
and 1
• The gradient that gets passed to the sigmoid function (or any
function) on the backward pass represents how much the function’s
output ultimately affects the loss; because the maximum slope of the
sigmoid function is 0.25, these gradients will at best be divided by 4
when sent backward to the previous operation in the model.
• Worse still, when the input to the sigmoid function is less than –2 or
greater than 2, the gradient those inputs receive will be almost 0,
since sigmoid(x) is almost flat at x = –2 or x = 2.
• Output: 0 to x • Output: -1 to 1
• Gradient: 0.5 • Gradient: 1
Experiments
• We’ll use the MNIST dataset, which consists of black and white
images of handwritten digits that are 28 × 28 pixels, with the value of
each pixel ranging from 0 (white) to 255 (black)
• dataset is predivided into a training set of 60,000 images and a
testing set of 10,000 additional images
•
Data Preprocessing
model = Sequential()
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5)) # 50% of neurons will be dropped out
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5)) # Dropout applied again
model.add(Dense(10, activation='softmax'))