Linear Regression Gradient Descent Vs Analytical Solution
Linear Regression Gradient Descent Vs Analytical Solution
Egor Howell
Introduction
Let’s begin by explaining Gradient Descent. This is going to be a brief description as this topic has
been covered thoroughly, so please refer to other blogs or tutorials if you want a more dense
1 of 5 2/22/2023, 6:28 PM
Linear Regression: Gradient Descent Vs Analytical Solution https://towardsdatascience.com/why-gradient-descent-is-so-common-...
explanation.
In Machine Learning, this differential function is the Loss Function, which tells us how well our
current model fits the data. Gradient Descent is then used to update the current parameters of the
model to minimize the Loss Function.
Perhaps the easiest example to demonstrate Gradient Descent is for a Simple Linear
Regression Model. In this case, our hypothesis function, h(x), depends on a single feature
variable, x:
Where θ_0 and θ_1 are the parameters of the model. The Loss Function for this problem is
the Sum of Squares Error (SSE):
Therefore, we will use Gradient Descent to find the value of the parameters that minimize the
above Loss Function.
As you can see, the Loss Function is differentiable and has a parabolic shape, hence it has a
minimum. As mentioned before, Gradient Descent updates the parameters of the model by taking
small steps in the opposite direction of the gradient. Therefore, we need to compute the
gradient of the Loss Function with respect to the two parameters:
2 of 5 2/22/2023, 6:28 PM
Linear Regression: Gradient Descent Vs Analytical Solution https://towardsdatascience.com/why-gradient-descent-is-so-common-...
Where η is the learning rate that determines the step size that each parameter is updated by.
The learning rate is between zero and one and specifies how quickly we converge to the
minimum. If it is too large, we may overshoot the minimum, however too small leads to a
greater computational time. Therefore, a happy medium needs to be found. This is where
Hyperparameter Tuning is used through methods such as Grid and Random Search or
even a Bayesian approach.
Given enough iterations, in Data Science this is known as Training Epochs, the gradient will
tend to zero. Therefore, the current values of the parameters have minimized the Loss Function
and have converged.
Analytical Solution
However, there does exist an analytical solution to the Simple Linear Regression Model. Instead
of using numerical techniques, we can simply set the partial derivative equations to zero:
Equation by author
Equation by author
This is a system of two linear equations with two unknowns that can be solved analytically.
Through mathematical derivation and rearranging, the values of the parameters that satisfy the
above equations are:
3 of 5 2/22/2023, 6:28 PM
Linear Regression: Gradient Descent Vs Analytical Solution https://towardsdatascience.com/why-gradient-descent-is-so-common-...
Where x̅ and ȳ are the mean of the data and the mean of the target variable respectively.
Therefore, through calculating these averages we can find the parameters that minimize the Loss
Function without using an iterative approach!
The above equation is the analytical solution for the Simple Linear Regression Model. This
is just a reduced version of the general solution for Linear Regression Models where we
could have more than two unknown parameters:
Where X is the matrix of the data, Y, is the target variable matrix and ϴ is the matrix of
parameters.
So why do we use Gradient Descent when an analytical solution exists? This answer is solely based
on the computational time and space cost.
The time complexity of Gradient Descent is O(kn²) where k is the number of features and n is the
total number of data points. This complexity can further be improved through vectorized
implementations. This is how most Machine Learning algorithms are implemented today.
However, the general analytical solution for Linear Regression has a time complexity of O(�³).
Therefore, for small datasets the difference is negligible but the computational time difference
grows exponentially as the data size increases. Most datasets in practice are around 100 features
with 1 million rows. Thus, the analytical solution is not feasible for these scenarios.
Furthermore, for some models, such as the Poisson Regression and Logistic Regression,
setting the derivatives to zero leads to a set of non-linear equations with no closed-form
analytical solution, Thus, we are forced to use numerical methods such as Gradient Descent.
Conclusion
Gradient Descent is preferred over analytical solutions due to its computational speed and the lack
of closed-form solutions for some Regression models. This necessitates the implementation of
iterative numerical methods.
I hope you guys enjoyed this article and have learned something new! There are plenty of other
articles taking a more deep dive into some of the derivations I condensed in this post, so I would
recommend checking them out!
4 of 5 2/22/2023, 6:28 PM
Linear Regression: Gradient Descent Vs Analytical Solution https://towardsdatascience.com/why-gradient-descent-is-so-common-...
LinkedIn �
Twitter �
GitHub �
Kaggle �
(All emojis designed by OpenMoji — the open-source emoji and icon project. License: CC BY-SA
4.0)
Something Extra!
Shown below is a sample code I wrote in C to showcase how Gradient Descent can be
programmed!
#include <stdio.h>
#include <stdlib.h>
#include <math.h>double dydx(double x);int main(){ int epochs, i;
double learning_rate, x, x_new; printf("Enter your intial guess integer: ");
scanf("%lf", &x);
printf("Enter how many epochs: ");
scanf("%d", &epochs);
5 of 5 2/22/2023, 6:28 PM