GradientDescent-Regression_slides
GradientDescent-Regression_slides
Where
To find the least squares solution, we need to find the point at which the gradient of
the objective function is zero. The gradient of J(θ) with respect to θ is given by:
Least Squares Linear Regression
Set the gradient to 0 and solve for 𝛳 we get
●
This above equation is the normal equation for least squares linear regression.
The closed form solution for the linear regression is
It is important to note that the closed-form solution exists only when the matrix
Xᵀ * X is invertible. Also it is suitable for smaller datasets where matrix inversion is
computationally feasible.
Least Squares Linear Regression
●
● least squares linear regression, a closed-form solution allows us to
calculate the regression coefficients (e.g., β0 and β1) in one step using
matrix algebra, specifically by solving the normal equation.
Example
Let’s assume we have the following data points: x=[1,2,3,4,5], y=[2,3,5,4,6].
Compute β0 and β1
Polynomial Regression
● It is an extension of linear regression where the relationship between the
●
independent variable x and the dependent variable y is modeled as an nth
degree polynomial. It allows for more complex, non-linear
relationships between the input and output variables by fitting a
polynomial curve to the data.
In polynomial regression, the model takes
the form: y=β0+β1x+β2x2+⋯+βnxn+ϵ
Polynomial Regression
Polynomial regression can be solved using matrix algebra. Given a design
●
matrix X (which contains the original and polynomial-transformed features),
the parameters 𝛳 can be estimated by:
Where X is:
Learning Parameters: Gradient Descent
Gradient descent (GD) is a mechanism in supervised learning to learn parameters
● network by navigating the error surface in an efficient and principled
of neural
way. It is used to find the function parameters (coefficients) that minimize a loss
function.
Error surfaces, on the other hand, are graphical representations of the
relationship between the model’s parameters and the corresponding error values.
Multiply throughout by k
cos(β) = −1 when the angle is 180◦ ,
●
Learning Parameters: Gradient Descent
●
Gradient Descent:Example
●
Batch Gradient Descent
batch gradient descent: gradient descent uses all n training examples for weight
●updates
In this approach, the algorithm calculates the gradient of the cost function using the
entire dataset before updating the weights (or parameters).
Single Weight Update per Epoch: After calculating the average gradient using all
training examples, the algorithm updates the model's weights once. This process is
repeated for multiple iterations until the cost function converges to a minimum.
Batch Gradient Descent
Consider the example of linear regression hθ(x)=θ0+θ1x, let the cost function
●be mean squared error.
This makes the algorithm faster and more memory-efficient, especially for large
datasets, but it can introduce some noise into the updates
In the above cost function take m =1, so each weight update will happen for each data
point.
Stochastic Gradient Descent
Algorithm:
●
Mini-Batch Gradient Descent
the algorithm updates the parameters after it sees mini batch size number of data
●
points.
Averaging the gradients gives a better sense of gradient direction which is consistent
with number of samples.
Mini-batch version SGD is default option for training neural networks
●