Stochastic Gradient Descent
Stochastic Gradient Descent
• For large datasets, computing the gradient using all data points can be
slow and memory-intensive. This is where SGD comes into play.
• Instead of using the full dataset to compute the gradient at each step, SGD
uses only one random data point (or a small batch of data points) at each
iteration. This makes the computation much faster.
Path followed by batch gradient descent vs. path followed by SGD:
The key difference from traditional gradient descent is that, in SGD, the
parameter updates are made based on a single data point, not the entire
dataset.
Advantages of Stochastic Gradient Descent:
• Efficiency
• Memory Efficiency
• Escaping Local Minima
• Online Learning
Applications of Stochastic Gradient Descent
SGD and its variants are widely used across various domains of machine
learning:
• Deep Learning
• Natural Language Processing (NLP)
• Computer Vision
• Reinforcement Learning