0% found this document useful (0 votes)
14 views30 pages

Deep Learning (Part 2) - Loss Function and Gradient Function - by Sumbatilinda - Medium

The document discusses the concepts of loss functions and gradient descent in deep learning. It explains how loss functions quantify the difference between predicted and actual values, with examples for regression and classification tasks. Additionally, it outlines the gradient descent algorithm used to optimize model parameters by minimizing the cost function through iterative updates based on the gradient.

Uploaded by

Amit Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views30 pages

Deep Learning (Part 2) - Loss Function and Gradient Function - by Sumbatilinda - Medium

The document discusses the concepts of loss functions and gradient descent in deep learning. It explains how loss functions quantify the difference between predicted and actual values, with examples for regression and classification tasks. Additionally, it outlines the gradient descent algorithm used to optimize model parameters by minimizing the cost function through iterative updates based on the gradient.

Uploaded by

Amit Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

3/9/25, 6:11 PM Deep Learning(Part 2).

Loss Function and Gradient Function | by Sumbatilinda | Medium

Get unlimited access to the best of Medium for less than $1/week. Become a member

Deep Learning(Part 2). Loss Function and


Gradient Function
Sumbatilinda · Follow
17 min read · Apr 9, 2024

Listen Share More

Gradient Descent and Loss Function were among the first ideas I learned when I
began studying machine learning. I understood it better when I started to put more
work in understanding deep learning. It seems overly abstract at first (maybe as a
result of the resource I was utilizing).

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 1/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

I’ve worked really hard to learn more about it, and today I’ve made the decision to
turn the notes I took on the subject into a brief blog post to help newcomers who,
like me, find the subject confusing.

So first we will start by getting to understand more about what is a loss function

Loss Function
In machine learning, loss functions help models determine how wrong it is and
improve itself based on that wrongness. They are mathematical functions that
quantify the difference between predicted and actual values in a machine learning
model, but this isn’t all they do.

Loss function is also referred to as cost function or error function.

A loss function quantifies how well a model performs during training phase.

The loss function estimates how well a particular algorithm models the provided
data. Loss functions are classified into two classes based on the type of learning task

Regression Models: predict continuous values.

Classification Models: predict the output from a set of finite categorical values.

The loss function is a function and this means it has inputs and outputs.

The inputs are the predictions from the model as well as the ground truths and the
output is some number known as the loss. The loss quantifies how good or bad the
the prediction of the model was.

Therefore, depending on the type of problem, the loss function could be anything.

Let us give an example, imagine we are dealing with a regression problem, then our
loss function could be mean squared loss .

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 2/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

If we are dealing with a classification, then this could be a cross entropy loss.

Therefore, the question here we need to ask is , “how is loss function used in
training”

Loss function in training

So this here below is a simple neural network. The goal is we want this network to
learn how to takes in an image and classify it as a dog or not a a dog. This training of
course is done on a thousands of images label pairs.

After that we go to the inference phase where we pass in unseen image to the
network and the network is ideally able to identify whether it is a dog or not a dog

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 3/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

So first step is passing the image trough the network, then the network will make a
prediction of whether this image is a dog or not a dog, then it gives a probability
between 0and 1 . Then this is is compared to the ground truth to generate a loss and
in our case it is 1 since it is a dog. An since this is a classification, a cross entropy
loss can be used here. This will tell if it is a dog or a cat.

However, if the predicted probability is far from 1, then the model has made an
incorrect prediction and the loss is high. In binary classification tasks like this,
cross-entropy loss is commonly used to quantify the difference between the
predicted probability distribution and the true distribution.

Now let us go back and discuss more on types of loss functions:

REGRESSION LOSSES

Mean Squared loss

It is the Mean of Square of Residuals for all the datapoints in the dataset.
Residuals is the difference between the actual and the predicted prediction by
the model.

To change negative numbers into positive values, residuals are squared. It is


possible for the normal error to be positive or negative.

When positive and negative numbers are added together, the result could be 0.
This will inform the model that, although the net error is zero and it is operating

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 4/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

well, it is still operating poorly.

As a result, only positive values are picked to obtain positive, and squaring is
done to obtain the model’s real performance.

Larger errors are likewise given more weight when squaring. Squaring the error
will penalize the model more and help it approach the minimal value faster
when the cost function is far from its minimal value.

import numpy as np

# Mean Squared Error


def mse(y, y_pred):
return np.sum((y - y_pred) ** 2) / np.size(y)

# Example usage
y_true = np.array([3, -0.5, 2, 7])
y_pred = np.array([2.5, 0.0, 2, 8])

mse_value = mse(y_true, y_pred)


print("Mean Squared Error:", mse_value)

This is my output:

When you calculate the MSE for the provided true and predicted values, it comes
out to be 0.375 . This means, on average, the squared difference between the true
values and the predicted values is 0.375 . A smaller MSE indicates better agreement
between the true and predicted values.

Mean Absolute Error (MAE)

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 5/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

It is the Mean of Absolute of Residuals for all the datapoints in the dataset.
Residuals is the difference between the actual and the predicted prediction by
the model.

The absolute of residuals is done to convert negative values to positive values.

Mean is taken to make the loss function independent of number of datapoints in


the training set.

MAE is generally less preferred over MSE as it is harder to calculate the


derivative of the absolute function because absolute function is not
differentiable at the minima.

Example code:

import numpy as np

# Mean Absolute Error


def mae(y, y_pred):
return np.mean(np.abs(y - y_pred))

# Example usage
y_true = np.array([3, -0.5, 2, 7])
y_pred = np.array([2.5, 0.0, 2, 8])

mae_value = mae(y_true, y_pred)


print("Mean Absolute Error:", mae_value)

This is my output:

When you calculate the MAE for the provided true and predicted values, it comes
out to be 0.5 . This means, on average, the absolute difference between the true
values and the predicted values is 0.5 . A smaller MAE indicates better agreement
between the true and predicted values.

CLASSIFICATION
Lets us first define what is Cross entropy

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 6/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

Cross Entropy

Cross-entropy, also known as logarithmic loss or log loss, is a popular loss function
used in machine learning to measure the performance of a classification model.

It measures the average number of bits required to identify an event from one
probability distribution, p, using the optimal code for another probability
distribution, q. In other words, cross-entropy measures the difference between the
discovered probability distribution of a classification model and the predicted
values.

Cross-Entropy Loss

Also known as Negative Log Likelihood. It is the commonly used loss function for
classification. Cross-entropy loss progress as the predicted probability diverges
from the actual label.

The cross-entropy loss function is used to find the optimal solution by adjusting the
weights of a machine learning model during training. The objective is to minimize
the error between the actual and predicted outcomes. Thus, a measure closer to 0 is
a sign of a good model, whereas a measure closer to 1 is a sign of a poor-performing
model.

Lets see it in a code:

import numpy as np

def log_loss(y_true, y_predicted, epsilon=1e-15):


y_predicted_new = np.maximum(np.minimum(y_predicted, 1-epsilon), epsilon)
return -np.mean(y_true*np.log(y_predicted_new) + (1 - y_true) * np.log(1-y_

# Example usage
y_true = np.array([1, 1, 0, 0, 1])
y_predicted = np.array([1, 1, 0, 0, 1])

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 7/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

loss_value = log_loss(y_true, y_predicted)


print("Log Loss:", loss_value)

This is my output: Log Loss: 9.992007221626415e-16

This function calculates the log loss (cross-entropy loss) between the true values
y_true and the predicted values y_predicted , with an optional small value epsilon

to prevent taking the logarithm of very small values or 1 minus very small values,
which can lead to numerical instability.

You can adjust the value of epsilon as needed. The function returns the calculated
log loss.

I hope that was a good explanation for loss functions:

Now lets us look at Gradient Descent

GRADIENT DESCENT
I know you maybe asking yourself what is a gradient descent;

Bute let us first define what is a gradient:

Gradient

This is the heart of machine learning

A gradient is nothing but a derivative that defines the effects on outputs of the
function with a little bit of variation in inputs.

I just want to tell you i had challenges when i started learning about gradient descent so
i understand why you may also feeling the same way. Don't worry, i made it easier for
you to understand it.

The concept of gradient descent is very important in deep learning and neural
networks.

Machine learning largely relies on optimization algorithms since they help to alter
the model’s parameters to improve its performance on training data.

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 8/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

Using these methods, the optimal set of parameters to minimize a cost function can
be identified. The optimization approach adopted can have a significant impact on
the rate of convergence, the amount of noise in the updates, and the efficacy of the
model’s generalization.

It is essential to use the right optimization method for a certain case in order to
guarantee that the model is optimized successfully and reaches optimal
performance.

It could be beneficial to go over a few linear regression topics before we get started
with gradient descent. As you may remember, the formula for calculating a line’s
slope is y = mx + b, where m stands for the slope and b for the y-intercept.

You might also remember using the mean squared error formula to calculate the
error between the actual output and the expected output (y-hat) while plotting a
scatterplot in statistics to determine the line of best fit. Similar behavior is seen in
the gradient descent algorithm, which is based on a convex function.

We only use the beginning point arbitrarily to assess the performance. We will
calculate the derivative, or slope, from that initial point. Using a tangent line, we can
then determine how steep the slope is. The weights and bias changes to the
parameters will be based on the slope. The slope will be steeper at the beginning,
but it should progressively get less steep as new parameters are generated, until it
reaches the point of convergence, which is the lowest point on the curve.

The objective of gradient descent is to minimize the cost function, or the error
between the anticipated and actual values of y. This is similar to the process of
determining the linear regression line of best fit. It needs two data points to
accomplish this: a direction and a learning rate. Future iterations’ partial derivative
computations are determined by these elements, enabling the process to
progressively approach the local or global minimum (also known as the point of
convergence).

A gradient descent is an algorithm to minimize function by optimizing parameters.

In machine learning, the Gradient Descent (GD) optimization technique is widely


employed to help determine the optimal set of parameters for a given model.

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 9/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

GD uses the gradient of the cost function in relation to the model parameters to
iteratively update the model parameters.

It works by iteratively adjusting the weights or parameters of the model in the


direction of the negative gradient of the cost function until the minimum of the cost
function is reached.

The strategy looks for the lowest point of the cost function by moving in the
opposite direction from what the gradient implies, which is the direction of the cost
function’s sharpest rise.

Because the gradient of the cost function must be calculated for every algorithm
iteration across the entire training dataset, GD can also be computationally costly.
Since GD can converge to the global minimum of the cost function under certain
conditions, it is often used as a benchmark for other optimization techniques.

So just a more explanation on this, With gradient descent, we start with a a random
guess, and then we slowly move to the best answer.

So the formular can be implemented as :

New Value = Old Value — Step Size

The step size here means hoe you want to move. In the world of machine learning in
term of mathematical Formula, the step size is given by Learning Rate multiplied by
Slope.

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 10/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

Mathematically, this process works by calculating the gradient (or slope) of the
function with respect to each parameter and then moving in the direction opposite
to the gradient to reach the minimum of the function.

Do not mind my drawing, ha ha ha

So the goal is always to acheive the minima

Steps to implement Gradient Descent

1. Randomly initialize values.

2. Update values.

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 11/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

3. Repeat until slope =0

A derivative is a term that comes from calculus and is calculated as the slope of the
graph at a particular point. The slope is described by drawing a tangent line to the
graph at the point. So, if we are able to compute this tangent line, we might be able
to compute the desired direction to reach the minima.

Learning rate (also referred to as step size or the alpha) is the size of the steps that
are taken to reach the minimum. This is typically a small value, and it is evaluated
and updated based on the behavior of the cost function. High learning rates result in
larger steps but risks overshooting the minimum.

Learning rate must be chosen wisely as:


1. if it is too small, then the model will take some time to learn.
2. if it is too large, model will converge as our pointer will shoot and we’ll not be able
to get to minima.

Adjusting the learning rate is crucial to balance convergence speed and avoiding
overshooting the optimal solution.

Now let us implement an example here.

First i imported the required packages as follows:

import numpy as np
import tensorflow as tf

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 12/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

from tensorflow import keras


import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

I went ahead and imported my dataset.csv

df = pd.read_csv("insurance_data.csv")
df.head()

This is how it looks like:

Next was defining my X and Y and splitting the data fro training and testing

X = df[['age', 'affordibility']]
y = df['bought_insurance']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_tests = train_test_split(X, y, test_size = 0.2, ran

I went ahead, jut to look at how my x_train looks like, just as shown below:

X_train

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 13/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

Next, i scaled my X values, so as it is within range 0 and 1

X_train_scaled = X_train.copy() # Create a copy of the DataFrame


X_train_scaled['age'] = X_train_scaled['age'] / 100

X_test_scaled = X_test.copy()
X_test_scaled['age'] = X_test_scaled['age'] / 100

So this is how my train scaled looks like. You can see it is now scaled.

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 14/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

X_train_scaled

This is my output

Next, i did build a neural network

model = keras.Sequential([
keras.layers.Dense(1, input_shape=(2,), activation='sigmoid', kernel_initia
])
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 15/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])

keras.Sequential : This creates a sequential model, where layers are stacked


sequentially. In this case, there's only onelayer.

keras.layers.Dense : This is a fully connected layer. It takes several parameters:

1 : This indicates the number of neurons in the layer.

input_shape=(2,) : This specifies the shape of the input data. In this case, it
expects inputs of shape (2,), meaning each input sample has two features (in our
case age and affordibility).

activation='sigmoid' : This sets the activation function for the layer to the
sigmoid function, which squashes the output between 0 and 1.

kernel_initializer='ones' : This initializes the weights of the layer to ones. This


is not typical and often used for demonstration purposes.

bias_initializer='zeros' : This initializes the biases of the layer to zeros.

model.compile : This compiles the model. It takes several parameters:

optimizer='adam' : This sets the optimization algorithm to Adam, a popular


optimization algorithm.

loss='binary_crossentropy' : This sets the loss function to binary cross-entropy,


which is commonly used for binary classification problems.

metrics=['accuracy'] : This specifies the evaluation metric during training to be


accuracy, which measures the fraction of correctly classified samples.

Next i did fit the model, i took 5000 epochs after trial and error and that is what gave
me a good accuracy for my model

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 16/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

model.fit(X_train_scaled, y_train, epochs = 5000)

I just take the last part of the epoch a screenshot to show how it looks like,

Next, i evaluated the model on test dataset

model.evaluate(X_test_scaled, y_tests)

This was my output:

[0.26470303535461426, 1.0] : This is the final output of the model.fit() method. It


returns a list where the first element is the loss and the second element is the
accuracy on the training data after training is complete. In this case, the loss is
0.2647 and the accuracy is 1.0, meaning the model achieved perfect accuracy on the
training data.

So this model can be used to make prediction son whether a purseon of certain age
can be able to purchase insuarence or not

model.predict(X_test_scaled)

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 17/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

This was my output:

So the most awaited part is here.

Let us now implement the gradient function to get the weights used and bias

import numpy as np
from sklearn.metrics import log_loss

def sigmoid_numpy(x):
return 1/(1+np.exp(-x))

def gradient_descent(age, affordibility, y_true, epochs, loss_threshold):


# Initial values for weights and bias
w1 = w2 = 1
bias = 0
rate = 0.5
n = len(age)

for i in range(epochs):
# Calculate weighted sum and predicted values
weighted_sum = w1 * age + w2 * affordibility + bias
y_predicted = sigmoid_numpy(weighted_sum)

# Calculate loss
loss = log_loss(y_true, y_predicted)

# Calculate derivatives
w1d = (1/n) * np.dot(np.transpose(age), (y_predicted - y_true))
w2d = (1/n) * np.dot(np.transpose(affordibility), (y_predicted - y_true
bias_d = np.mean(y_predicted - y_true)

# Update weights and bias


w1 = w1 - rate * w1d
w2 = w2 - rate * w2d
bias = bias - rate * bias_d

print(f'Epoch:{i}, w1:{w1}, w2:{w2}, bias:{bias}, loss:{loss}')

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 18/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

if loss <= loss_threshold:


break

return w1, w2, bias

Optimization Objective: The gradient_descent function aims to minimize the loss


function, which is computed using the log_loss function. By iteratively adjusting
the model parameters (weights w1 and w2 , and bias) based on the gradients of the
loss function, the function seeks to improve the model's predictive accuracy on the
training data ( age and affordibility ) relative to the true labels ( y_true ).

Gradient Descent Strategy: The function implements the gradient descent


optimization algorithm by computing the gradients of the loss function with respect
to the model parameters ( w1 , w2 , and bias ). These gradients indicate the

direction and magnitude of parameter updates that would reduce the loss. The

function then updates the parameters in the direction that minimizes the loss using

a fixed learning rate ( rate`) across multiple epochs.

Training Process: During the training process, the function iterates over the training
dataset for a specified number of epochs ( 1000 in our example). In each epoch, it
computes predictions, evaluates the loss, computes gradients, and updates
parameters accordingly. By repeating this process, the function gradually adjusts
the model parameters to minimize the loss and improve predictive accuracy.

Control Parameters: The function accepts control parameters such as the learning
rate ( rate ), number of epochs, and loss threshold ( 0.4631 in our example). These
parameters allow users to customize the training process, controlling the speed of
convergence and determining when to terminate training based on the achieved
loss level.

Monitoring Progress: Throughout the training process, the function provides


feedback by printing information such as the epoch number, updated parameter
values ( w1 , w2 , bias), and current loss. This enables users to monitor the progress
of training and assess whether the model is converging effectively towards
minimizing the loss.

Next, we call the function


https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 19/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

gradient_descent(X_train_scaled['age'], X_train_scaled['affordibility'], y_trai

By calling gradient_descent with these parameters, you are training the logistic
regression model using gradient descent optimization on the provided training data
( X_train_scaled['age'] , X_train_scaled['affordibility'] , y_train ) for 1000 epochs
or until the loss falls below 0.4631. The function will return the learned parameters
( w1 , w2 , bias ) of the logistic regression model.

This is my output :

So this means that

w1 = 6.729458709063028

w2 = 1.3482027165467205

bias = -3.583182157

Types of Gradient Descent

1. Stochastic Gradient Descent


SGD (Stochastic Gradient Descent) is a well-known machine learning optimization
technique.

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 20/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

In this variation of gradient descent, the model parameters are adjusted for each
iteration depending on the gradient of the cost function relative to a single training
sample.

Each iteration of this approach selects a single training sample at random. Gradient
Descent modifies the model parameters less often than SGD, leading to faster
convergence.

Yet, using a single training sample at random might lead to noisy updates and a very
variable cost function. SGD, despite its noise, is commonly preferred over Gradient
Descent because it converges more quickly and requires less memory to store the
cost function gradients.

2. Mini-batch Gradient Descent


Mini-batch Gradient Descent is a Gradient Descent version that falls in between
Stochastic Gradient Descent and Gradient Descent.

The model parameters are updated based on the average gradient of the cost
function with respect to the model parameters across each mini-batch, which are
smaller subsets of the training dataset of equal sizes.

When compared to Gradient Descent and Stochastic Gradient Descent, Mini-batch


Gradient Descent changes the model parameters more often. The noise of stochastic
updates and the computing cost of full-batch updates are traded off, and mini-batch
gradient descent strikes a compromise between the two.

It is the deep learning optimization method that is most frequently employed and
provides a fair balance between speed and accuracy.

3. Batch gradient descent


Batch gradient descent sums the error for each point in a training set, updating the
model only after all training examples have been evaluated. This process referred to
as a training epoch.

While this batching provides computation efficiency, it can still have a long
processing time for large training datasets as it still needs to store all of the data into
memory. Batch gradient descent also usually produces a stable error gradient and
convergence, but sometimes that convergence point isn’t the most ideal, finding the
local minimum versus the global one.

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 21/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

Lets look at the differences, here is an image i got from the internet, hope it helps

The most popular machine learning optimization methods are gradient descent,
stochastic gradient descent, and mini-batch gradient descent. Stochastic Gradient
Descent converges quickly but has high noise, whereas Gradient Descent converges
slowly but has low noise. With a reasonable level of noise, Mini-batch Gradient
Descent strikes a decent balance between speed and accuracy.

The size of the dataset, the amount of memory that is available, and the level of
precision necessary all play a role in selecting the best method. Understanding the
features of each algorithm will help you choose the best one for a given problem as a
data scientist or machine learning practitioner.

The article discusses the fundamental concepts of loss functions and gradient
descent in the context of training machine learning models. It highlights the
importance of loss functions in quantifying the disparity between model predictions
and true labels, essential for guiding the optimization process. Moreover, it explains
the role of gradient descent as an iterative optimization algorithm, iteratively
adjusting model parameters to minimize the loss function.

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 22/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

Enjoy the read

Like and follow, Thank you

Deep Learning Machine Learning Machine Learning Ai AI

Follow

Written by Sumbatilinda
66 Followers · 14 Following

Data Scientist | Data Analyst | Machine Learning


Open in app

Search

Responses (2)

Amitmsh

What are your thoughts?

仇翀
Nov 23, 2024

model.fit(X_train_scaled, y_train, epochs = 5000)

How this part is related to gradient descent?

Reply

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 23/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

Okwachian him/he
Apr 9, 2024

Great read...😊

1 reply Reply

More from Sumbatilinda

Sumbatilinda

A Comprehensive Overview of AWS Service Categories


Amazon Web Services (AWS) offers a vast array of cloud-based services, providing
organizations the tools needed to build and scale…

Sep 12, 2024

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 24/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

Sumbatilinda

Support Vector Machine (SVM) Algorithm.


“A Support Vector Machine (SVM) is a powerful machine learning algorithm used primarily for
classification and regression tasks.

May 1, 2024 10

Sumbatilinda

Exploring K-Means Clustering: Theory, Implementation, and Choosing


Optimal Clusters
Introduction

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 25/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

Aug 26, 2024 9

Sumbatilinda

Land Surface Temperature in Kenya


Let’s break down the process of extracting and visualizing monthly land surface temperature
(LST) for Kenya using Google Earth Engine into…

Mar 7, 2024 1

See all from Sumbatilinda

Recommended from Medium

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 26/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

Naoki

Solving Max-Cut Problems with D-Wave Quantum Annealing


Split the Network for Maximum Gain!

Nov 24, 2024

CodexRushi

Gradient Descent: Explanation with Python Code


Gradient descent is one of the algorithms considered to form the basis of machine learning
and optimization. It is the basis for many…

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 27/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

Oct 24, 2024

Lists

Predictive Modeling w/ Python


20 stories · 1850 saves

Natural Language Processing


1973 stories · 1616 saves

Practical Guides to Machine Learning


10 stories · 2220 saves

The New Chatbots: ChatGPT, Bard, and Beyond


12 stories · 562 saves

In TDS Archive by Leonie Monigatti

A Visual Guide to Learning Rate Schedulers in PyTorch


LR decay and annealing strategies for Deep Learning in Python

Dec 6, 2022 1.4K 7

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 28/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

In TDS Archive by Vyacheslav Efimov

Understanding Deep Learning Optimizers: Momentum, AdaGrad,


RMSProp & Adam
Gain intuition behind acceleration training techniques in neural networks

Dec 30, 2023 489 4

In Biased-Algorithms by Amit Yadav

Stochastic Gradient Descent from Scratch in Python


If you think you need to spend $2,000 on a 180-day program to become a data scientist, then
listen to me for a minute.
https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 29/30
3/9/25, 6:11 PM Deep Learning(Part 2). Loss Function and Gradient Function | by Sumbatilinda | Medium

Sep 14, 2024 3

Piyush Kashyap

Understanding Nesterov Accelerated Gradient (NAG)


Welcome to our next discussion in the deep learning series! In this article, we’ll cover an
important optimization technique called…

Nov 2, 2024 2

See more recommendations

https://medium.com/@sumbatilinda/deep-learning-part-2-loss-function-and-gradient-function-2f64c566a1d6 30/30

You might also like