Deep Learning (Part 2) - Loss Function and Gradient Function - by Sumbatilinda - Medium
Deep Learning (Part 2) - Loss Function and Gradient Function - by Sumbatilinda - Medium
Get unlimited access to the best of Medium for less than $1/week. Become a member
Gradient Descent and Loss Function were among the first ideas I learned when I
began studying machine learning. I understood it better when I started to put more
work in understanding deep learning. It seems overly abstract at first (maybe as a
result of the resource I was utilizing).
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 1/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
I’ve worked really hard to learn more about it, and today I’ve made the decision to
turn the notes I took on the subject into a brief blog post to help newcomers who,
like me, find the subject confusing.
So first we will start by getting to understand more about what is a loss function
Loss Function
In machine learning, loss functions help models determine how wrong it is and
improve itself based on that wrongness. They are mathematical functions that
quantify the difference between predicted and actual values in a machine learning
model, but this isn’t all they do.
A loss function quantifies how well a model performs during training phase.
The loss function estimates how well a particular algorithm models the provided
data. Loss functions are classified into two classes based on the type of learning task
Classification Models: predict the output from a set of finite categorical values.
The loss function is a function and this means it has inputs and outputs.
The inputs are the predictions from the model as well as the ground truths and the
output is some number known as the loss. The loss quantifies how good or bad the
the prediction of the model was.
Therefore, depending on the type of problem, the loss function could be anything.
Let us give an example, imagine we are dealing with a regression problem, then our
loss function could be mean squared loss .
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 2/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
If we are dealing with a classification, then this could be a cross entropy loss.
Therefore, the question here we need to ask is , “how is loss function used in
training”
So this here below is a simple neural network. The goal is we want this network to
learn how to takes in an image and classify it as a dog or not a a dog. This training of
course is done on a thousands of images label pairs.
After that we go to the inference phase where we pass in unseen image to the
network and the network is ideally able to identify whether it is a dog or not a dog
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 3/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
So first step is passing the image trough the network, then the network will make a
prediction of whether this image is a dog or not a dog, then it gives a probability
between 0and 1 . Then this is is compared to the ground truth to generate a loss and
in our case it is 1 since it is a dog. An since this is a classification, a cross entropy
loss can be used here. This will tell if it is a dog or a cat.
However, if the predicted probability is far from 1, then the model has made an
incorrect prediction and the loss is high. In binary classification tasks like this,
cross-entropy loss is commonly used to quantify the difference between the
predicted probability distribution and the true distribution.
REGRESSION LOSSES
It is the Mean of Square of Residuals for all the datapoints in the dataset.
Residuals is the difference between the actual and the predicted prediction by
the model.
When positive and negative numbers are added together, the result could be 0.
This will inform the model that, although the net error is zero and it is operating
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 4/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
As a result, only positive values are picked to obtain positive, and squaring is
done to obtain the model’s real performance.
Larger errors are likewise given more weight when squaring. Squaring the error
will penalize the model more and help it approach the minimal value faster
when the cost function is far from its minimal value.
import numpy as np
# Example usage
y_true = np.array([3, -0.5, 2, 7])
y_pred = np.array([2.5, 0.0, 2, 8])
This is my output:
When you calculate the MSE for the provided true and predicted values, it comes
out to be 0.375 . This means, on average, the squared difference between the true
values and the predicted values is 0.375 . A smaller MSE indicates better agreement
between the true and predicted values.
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 5/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
It is the Mean of Absolute of Residuals for all the datapoints in the dataset.
Residuals is the difference between the actual and the predicted prediction by
the model.
Example code:
import numpy as np
# Example usage
y_true = np.array([3, -0.5, 2, 7])
y_pred = np.array([2.5, 0.0, 2, 8])
This is my output:
When you calculate the MAE for the provided true and predicted values, it comes
out to be 0.5 . This means, on average, the absolute difference between the true
values and the predicted values is 0.5 . A smaller MAE indicates better agreement
between the true and predicted values.
CLASSIFICATION
Lets us first define what is Cross entropy
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 6/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
Cross Entropy
Cross-entropy, also known as logarithmic loss or log loss, is a popular loss function
used in machine learning to measure the performance of a classification model.
It measures the average number of bits required to identify an event from one
probability distribution, p, using the optimal code for another probability
distribution, q. In other words, cross-entropy measures the difference between the
discovered probability distribution of a classification model and the predicted
values.
Cross-Entropy Loss
Also known as Negative Log Likelihood. It is the commonly used loss function for
classification. Cross-entropy loss progress as the predicted probability diverges
from the actual label.
The cross-entropy loss function is used to find the optimal solution by adjusting the
weights of a machine learning model during training. The objective is to minimize
the error between the actual and predicted outcomes. Thus, a measure closer to 0 is
a sign of a good model, whereas a measure closer to 1 is a sign of a poor-performing
model.
import numpy as np
# Example usage
y_true = np.array([1, 1, 0, 0, 1])
y_predicted = np.array([1, 1, 0, 0, 1])
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 7/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
This function calculates the log loss (cross-entropy loss) between the true values
y_true and the predicted values y_predicted , with an optional small value epsilon
to prevent taking the logarithm of very small values or 1 minus very small values,
which can lead to numerical instability.
You can adjust the value of epsilon as needed. The function returns the calculated
log loss.
GRADIENT DESCENT
I know you maybe asking yourself what is a gradient descent;
Gradient
A gradient is nothing but a derivative that defines the effects on outputs of the
function with a little bit of variation in inputs.
I just want to tell you i had challenges when i started learning about gradient descent so
i understand why you may also feeling the same way. Don't worry, i made it easier for
you to understand it.
The concept of gradient descent is very important in deep learning and neural
networks.
Machine learning largely relies on optimization algorithms since they help to alter
the model’s parameters to improve its performance on training data.
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 8/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
Using these methods, the optimal set of parameters to minimize a cost function can
be identified. The optimization approach adopted can have a significant impact on
the rate of convergence, the amount of noise in the updates, and the efficacy of the
model’s generalization.
It is essential to use the right optimization method for a certain case in order to
guarantee that the model is optimized successfully and reaches optimal
performance.
It could be beneficial to go over a few linear regression topics before we get started
with gradient descent. As you may remember, the formula for calculating a line’s
slope is y = mx + b, where m stands for the slope and b for the y-intercept.
You might also remember using the mean squared error formula to calculate the
error between the actual output and the expected output (y-hat) while plotting a
scatterplot in statistics to determine the line of best fit. Similar behavior is seen in
the gradient descent algorithm, which is based on a convex function.
We only use the beginning point arbitrarily to assess the performance. We will
calculate the derivative, or slope, from that initial point. Using a tangent line, we can
then determine how steep the slope is. The weights and bias changes to the
parameters will be based on the slope. The slope will be steeper at the beginning,
but it should progressively get less steep as new parameters are generated, until it
reaches the point of convergence, which is the lowest point on the curve.
The objective of gradient descent is to minimize the cost function, or the error
between the anticipated and actual values of y. This is similar to the process of
determining the linear regression line of best fit. It needs two data points to
accomplish this: a direction and a learning rate. Future iterations’ partial derivative
computations are determined by these elements, enabling the process to
progressively approach the local or global minimum (also known as the point of
convergence).
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 9/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
GD uses the gradient of the cost function in relation to the model parameters to
iteratively update the model parameters.
The strategy looks for the lowest point of the cost function by moving in the
opposite direction from what the gradient implies, which is the direction of the cost
function’s sharpest rise.
Because the gradient of the cost function must be calculated for every algorithm
iteration across the entire training dataset, GD can also be computationally costly.
Since GD can converge to the global minimum of the cost function under certain
conditions, it is often used as a benchmark for other optimization techniques.
So just a more explanation on this, With gradient descent, we start with a a random
guess, and then we slowly move to the best answer.
The step size here means hoe you want to move. In the world of machine learning in
term of mathematical Formula, the step size is given by Learning Rate multiplied by
Slope.
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 10/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
Mathematically, this process works by calculating the gradient (or slope) of the
function with respect to each parameter and then moving in the direction opposite
to the gradient to reach the minimum of the function.
2. Update values.
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 11/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
A derivative is a term that comes from calculus and is calculated as the slope of the
graph at a particular point. The slope is described by drawing a tangent line to the
graph at the point. So, if we are able to compute this tangent line, we might be able
to compute the desired direction to reach the minima.
Learning rate (also referred to as step size or the alpha) is the size of the steps that
are taken to reach the minimum. This is typically a small value, and it is evaluated
and updated based on the behavior of the cost function. High learning rates result in
larger steps but risks overshooting the minimum.
Adjusting the learning rate is crucial to balance convergence speed and avoiding
overshooting the optimal solution.
import numpy as np
import tensorflow as tf
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 12/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
df = pd.read_csv("insurance_data.csv")
df.head()
Next was defining my X and Y and splitting the data fro training and testing
X = df[['age', 'affordibility']]
y = df['bought_insurance']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_tests = train_test_split(X, y, test_size = 0.2, ran
I went ahead, jut to look at how my x_train looks like, just as shown below:
X_train
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 13/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
X_test_scaled = X_test.copy()
X_test_scaled['age'] = X_test_scaled['age'] / 100
So this is how my train scaled looks like. You can see it is now scaled.
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 14/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
X_train_scaled
This is my output
model = keras.Sequential([
keras.layers.Dense(1, input_shape=(2,), activation='sigmoid', kernel_initia
])
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 15/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
input_shape=(2,) : This specifies the shape of the input data. In this case, it
expects inputs of shape (2,), meaning each input sample has two features (in our
case age and affordibility).
activation='sigmoid' : This sets the activation function for the layer to the
sigmoid function, which squashes the output between 0 and 1.
Next i did fit the model, i took 5000 epochs after trial and error and that is what gave
me a good accuracy for my model
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 16/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
I just take the last part of the epoch a screenshot to show how it looks like,
model.evaluate(X_test_scaled, y_tests)
So this model can be used to make prediction son whether a purseon of certain age
can be able to purchase insuarence or not
model.predict(X_test_scaled)
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 17/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
Let us now implement the gradient function to get the weights used and bias
import numpy as np
from sklearn.metrics import log_loss
def sigmoid_numpy(x):
return 1/(1+np.exp(-x))
for i in range(epochs):
# Calculate weighted sum and predicted values
weighted_sum = w1 * age + w2 * affordibility + bias
y_predicted = sigmoid_numpy(weighted_sum)
# Calculate loss
loss = log_loss(y_true, y_predicted)
# Calculate derivatives
w1d = (1/n) * np.dot(np.transpose(age), (y_predicted - y_true))
w2d = (1/n) * np.dot(np.transpose(affordibility), (y_predicted - y_true
bias_d = np.mean(y_predicted - y_true)
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 18/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
direction and magnitude of parameter updates that would reduce the loss. The
function then updates the parameters in the direction that minimizes the loss using
Training Process: During the training process, the function iterates over the training
dataset for a specified number of epochs ( 1000 in our example). In each epoch, it
computes predictions, evaluates the loss, computes gradients, and updates
parameters accordingly. By repeating this process, the function gradually adjusts
the model parameters to minimize the loss and improve predictive accuracy.
Control Parameters: The function accepts control parameters such as the learning
rate ( rate ), number of epochs, and loss threshold ( 0.4631 in our example). These
parameters allow users to customize the training process, controlling the speed of
convergence and determining when to terminate training based on the achieved
loss level.
By calling gradient_descent with these parameters, you are training the logistic
regression model using gradient descent optimization on the provided training data
( X_train_scaled['age'] , X_train_scaled['affordibility'] , y_train ) for 1000 epochs
or until the loss falls below 0.4631. The function will return the learned parameters
( w1 , w2 , bias ) of the logistic regression model.
This is my output :
w1 = 6.729458709063028
w2 = 1.3482027165467205
bias = -3.583182157
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 20/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
In this variation of gradient descent, the model parameters are adjusted for each
iteration depending on the gradient of the cost function relative to a single training
sample.
Each iteration of this approach selects a single training sample at random. Gradient
Descent modifies the model parameters less often than SGD, leading to faster
convergence.
Yet, using a single training sample at random might lead to noisy updates and a very
variable cost function. SGD, despite its noise, is commonly preferred over Gradient
Descent because it converges more quickly and requires less memory to store the
cost function gradients.
The model parameters are updated based on the average gradient of the cost
function with respect to the model parameters across each mini-batch, which are
smaller subsets of the training dataset of equal sizes.
It is the deep learning optimization method that is most frequently employed and
provides a fair balance between speed and accuracy.
While this batching provides computation efficiency, it can still have a long
processing time for large training datasets as it still needs to store all of the data into
memory. Batch gradient descent also usually produces a stable error gradient and
convergence, but sometimes that convergence point isn’t the most ideal, finding the
local minimum versus the global one.
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 21/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
Lets look at the differences, here is an image i got from the internet, hope it helps
The most popular machine learning optimization methods are gradient descent,
stochastic gradient descent, and mini-batch gradient descent. Stochastic Gradient
Descent converges quickly but has high noise, whereas Gradient Descent converges
slowly but has low noise. With a reasonable level of noise, Mini-batch Gradient
Descent strikes a decent balance between speed and accuracy.
The size of the dataset, the amount of memory that is available, and the level of
precision necessary all play a role in selecting the best method. Understanding the
features of each algorithm will help you choose the best one for a given problem as a
data scientist or machine learning practitioner.
The article discusses the fundamental concepts of loss functions and gradient
descent in the context of training machine learning models. It highlights the
importance of loss functions in quantifying the disparity between model predictions
and true labels, essential for guiding the optimization process. Moreover, it explains
the role of gradient descent as an iterative optimization algorithm, iteratively
adjusting model parameters to minimize the loss function.
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 22/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
Follow
Written by Sumbatilinda
66 Followers · 14 Following
Search
Responses (2)
Amitmsh
仇翀
Nov 23, 2024
Reply
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 23/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
Okwachian him/he
Apr 9, 2024
Great read...😊
1 reply Reply
Sumbatilinda
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 24/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
Sumbatilinda
May 1, 2024 10
Sumbatilinda
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 25/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
Sumbatilinda
Mar 7, 2024 1
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 26/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
Naoki
CodexRushi
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 27/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
Lists
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 28/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium
Piyush Kashyap
Nov 2, 2024 2
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 30/30