0% found this document useful (0 votes)
30 views3 pages

SGD

stoch. gradient descent

Uploaded by

Sam Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views3 pages

SGD

stoch. gradient descent

Uploaded by

Sam Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Explain what gradient descent means.

a. Gradient descent aims to find the minimum of a function, typically a loss


function
J(θ), where θ represents the parameters of the model.

b. At each iteration, gradient descent calculates the gradient (derivative) of


the loss function with respect to the parameters
𝜃. Each component of the gradient vector indicates how much the function
f changes with respect to a small change in the parameter,
holding all other parameters constant.

c. The parameters 𝜃 are updated iteratively in the opposite direction of the


gradient to minimize the loss function. Alpha - the learning rate controls
step size of each update.

d. Iterative process: Steps b and c are repeated until a stopping criterion is


met (e.g., a maximum number of iterations, convergence of the loss function).

G.R.A.D.I.E.N.T.

Get initial parameters


Recognize gradient direction
Adjust parameters
Determine convergence criteria
Iterate until convergence
Evaluate performance
Navigate toward minimum
Tune learning rate

4b. Stochastic Gradient Descent (SGD) and conventional Gradient Descent (GD) are
both optimization algorithms used in training machine learning models, but they differ
in their approach to updating model parameters. Here’s when each one is typically
used:

Gradient Descent (GD):

Batch Processing: GD computes the gradient of the loss function with respect to the
entire training dataset. It then updates the model parameters once per epoch (one
pass through the entire dataset).
Suitable for Small Datasets: GD is suitable when the dataset fits entirely in memory
and is not too large. It ensures that each parameter update is based on the precise
average gradient computed across the entire dataset.
Advantages: Typically converges to the global minimum (for convex problems) or a
good local minimum (for non-convex problems) more reliably because it uses precise
gradient information.
Stochastic Gradient Descent (SGD):

Online Learning: SGD updates the model parameters incrementally for each training
example (or mini-batch of examples). It computes the gradient using only one
example (or a small batch) at a time.
Large Datasets: SGD is particularly useful for large datasets that cannot fit into
memory. It allows for iterative updates without needing to load the entire dataset into
memory at once.
Advantages: Faster convergence per iteration compared to GD because updates are
more frequent and use less computational resources per iteration.
When to Use SGD vs. GD:

GD: Use GD when you have a small to moderate-sized dataset that can fit into
memory and when you want to ensure precise updates based on the entire dataset.
It's also suitable for situations where you want a smoother convergence trajectory
towards the minimum.

SGD: Use SGD when dealing with large datasets or when implementing online
learning where you need to continuously update the model as new data arrives. It's
also beneficial in scenarios where computational resources are limited or when you
want faster updates per iteration.

Variants:

Mini-Batch SGD: A compromise between GD and SGD, where updates are made
based on small batches of data. It combines the benefits of both approaches by
reducing the variance of parameter updates compared to pure SGD while still being
computationally efficient.
In practice, the choice between GD and SGD (or its variants) depends on the specific
problem, the size of the dataset, computational resources available, and desired
convergence properties of the optimization process.

DOWNSIDES to SGD:
Due to its stochastic nature and the potential for noisy updates, SGD might not
always converge to the global minimum of the loss function. Instead, it often settles
for a good local minimum, which may or may not be optimal.

When to Use SGD vs. GD:


USE GD when you have a small to moderate-sized dataset that can fit into memory
and when you want to ensure precise updates based on the entire dataset
Use Stochastic Gradient Descent when dealing with large datasets or when
implementing online learning where you need to continuously update the model as
new data arrives.

You might also like