3 Types of Gradient Descent Algorithms For Small & Large Datasets
3 Types of Gradient Descent Algorithms For Small & Large Datasets
Introduction
Gradient Descent Algorithm (GD) is an iterative algorithm to find a Global Minimum of an
objective function (cost function) J(). The categorization of GD algorithm is for accuracy
and time consuming factors that are discussed below in detail. This algorithm is widely used
in machine learning for minimization of functions. Here,the algorithm to achieve objective
goal of picture below is in this tutorial below.
We will consider linear regression for algorithmic example in this article while talking about
gradient descent, although the ideas apply to other algorithms too, such as
Logistic regression
Neural networks
As we need to calculate the gradient on the whole dataset to perform just one update,
batch gradient descent can be very slow and is intractable for datasets that don't fit in
memory. After initializing the parameter with arbitrary values we calculate gradient of cost
function using following relation:
If you have 300,000,000 records you need to read in all the records into memory
from disk because you can't store them all in memory.
After calculating sigma for one iteration, we move one step.
Then repeat for every step.
This means it take a long time to converge.
Especially because disk I/O is typically a system bottleneck anyway, and this will
inevitably require a huge number of reads.
Batch gradient descent is not suitable for huge datasets. The code below explains
implementing gradient descent in python.
import numpy as np
import random
# initial theta
t0 = np.random.random(x.shape[1])
t1 = np.random.random(x.shape[1])
# Iterate Loop
while not converged:
# for each training sample, compute the gradient (d/d_theta j(theta))
grad0 = 1.0/m * sum([(t0 + t1*x[i] - y[i]) for i in range(m)])
grad1 = 1.0/m * sum([(t0 + t1*x[i] - y[i])*x[i] for i in range(m)])
# update theta
t0 = temp0
t1 = temp1
J = e # update error
iter += 1 # update iter
if iter == max_iter:
print 'Max interactions exceeded!'
converged = True
return t0,t1
The first step of algorithm is to randomize the whole training set. Then, for updation of
every parameter we use only one training example in every iteration to compute the
gradient of cost function. As it uses one training example in every iteration this algo is faster
for larger data set. In SGD, one might not achieve accuracy, but the computation of results
are faster.
After initializing the parameter with arbitrary values we calculate gradient of cost function
using following relation:
where, 'm' is the number of training examples
SGD Never actually converges like batch gradient descent does,but ends up wandering
around some region close to the global minimum.
reduces the variance of the parameter updates, which can lead to more stable
convergence.
can make use of highly optimized matrix, that makes computing of gradient very
efficient.
After initializing the parameter with arbitrary values we calculate gradient of cost function
using following relation:
where ' b ' is number of batches and ' m ' is number training examples.
is crucial parameter that controls how large steps our algorithm takes.
1. If is too large algorithm would take larger steps and algorithm may not
converge .
2. if is small, then smaller will be the steps and esay to converge.
Checking working of gradient descent
Plot the curve between Number of Iterations and value of cost function after that
number of iteration. This plot helps to identify whether gradient descent is working
properly or not.
"J() should decrease after every iteration and should become constant (or
converge ) after some iterations."
Above statement is because after every iteration of gradient descent and takes
values such that J() moves towards depth i.e. value of J() decreases after every
iteration.
Source: http://blog.hackerearth.com/3-types-gradient-descent-algorithms-small-large-data-sets?utm_source=facebook-
post&utm_campaign=Blog-gradient-descent-algorithms&utm_medium=he-handle