Module 4 Lab 2
Module 4 Lab 2
Gradient Descent is an optimization algorithm used to find the best parameters (like weights in a
model) that minimize a loss function (a measure of how wrong your model is).
How it works:
1. Start with random guesses for your parameters.
2. Calculate how “bad” your guess is using a loss function (like Mean Squared Error).
3. Compute the gradient (slope) of the loss with respect to each parameter—this tells you
which way to move to reduce the loss.
4. Update each parameter a little bit in the direction that reduces the loss.
5. Repeat until the loss stops getting smaller.
Update formula:
= parameter (like , , or )
= learning rate (step size)
gradient = partial derivative of the loss with respect to
🔍 Section 2 — Importing Required Libraries
import numpy as np
import matplotlib.pyplot as plt
import random
random.seed(42)
np.random.seed(42)
We want to find the best coefficients , , and for the quadratic equation
that fit our noisy data.
: Actual value
: Predicted value from our current guess
Lower MSE = better fit.
🔍 Section 6 — How Does Gradient Descent Improve Our Guess?
B. Make Predictions
For each , compute .
Example:
If and , then
E. Repeat
Keep repeating the steps (predict, compute loss, compute gradients, update) for many
iterations (epochs).
The loss should get smaller each time.
Partial derivatives tell us how much the loss will change if we change just one parameter.
They point in the direction of steepest increase; moving in the opposite direction reduces
the loss.
This is how gradient descent “knows” which way to step for each parameter.
Full Batch: Uses all data to compute the gradient in each update. No need to shuffle data;
order doesn’t matter.
Mini-Batch: Uses small, randomly selected subsets (mini-batches) to compute the gradient
and update parameters.
Shuffling is important to avoid biased batches (e.g., all one class in a batch).
Gradient Descent uses partial derivatives to update each parameter in the direction that
reduces loss.
Learning rate controls the size of each step.
Loss function (like MSE) measures how well the model fits the data.
Batch size (full vs. mini) affects how updates are computed and whether shuffling is
needed.
Visualization of loss helps you see if training is working.
Simple Analogy
Gradient descent is like finding the bottom of a valley (minimum loss) by feeling the slope (partial
derivatives) and always stepping downhill (negative gradient), adjusting your direction for each
parameter ( , , ) separately.
If you want more details, code examples, or further clarification on any step, just ask!