Gradient Descent and SGD
Gradient Descent and SGD
Keywords: optimization | gradient descent | sgd | minibatch sgd | linear regression | Download Notebook | Activate
Contents
Example: Linear regression
Animating SGD
Variations on a theme
Momentum
Contents
%matplotlib inline
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from scipy import stats
A great discussion (and where momentum image was stolen from) is at http://sebastianruder.com/optimizing-gradient-descent/
Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural
networks. Gradient descent is a way to minimize an objective function J parameterized by a model’s parameters θ ∈ R by updating
θ
d
the parameters in the opposite direction of the gradient of the objective function ∇ J (θ) w.r.t. to the parameters. The learning rate η
J
determines the size of the steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface
created by the objective function downhill until we reach a valley.
There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function.
Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform
an update.
T
fθ (x) = θ x
The cost function for our linear least squares regression will then be
m
1 (i) (i) 2
J (θ) = ∑(fθ (x − y )
2
i=1
x = x.flatten()
[<matplotlib.lines.Line2D at 0x108279358>]
Batch gradient descent
Assume that we have a vector of paramters θ and a cost function J (θ) which is simply the variable we want to minimize (our
objective function). Typically, we will find that the objective function has the form:
J (θ) = ∑ Ji (θ)
i=1
where J is associated with the i-th observation in our data set. The batch gradient descent algorithm, starts with some initial feasible
i
θ (which we can either fix or assign randomly) and then repeatedly performs the update:
i=1
where η is a constant controlling step-size and is called the learning rate. Note that in order to make a single update, we need to
calculate the gradient using the entire dataset. This can be very inefficient for large datasets.
for i in range(n_epochs):
params_grad = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * params_grad`
For a given number of epochs n , we first evaluate the gradient vector of the loss function using ALL examples in the data set,
epochs
and then we update the parameters with a given learning rate. This is where Theano and automatic differentiation come in handy,
and you will learn about them in lab.
Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-
convex surfaces.
In the linear example it’s easy to see that our update step then takes the form:
if counter % 25 == 0: preds.append(pred)
counter+=1
if maxsteps:
if counter == maxsteps:
break
return history, costs, preds, counter
np.random.rand(2)
xaug = np.c_[np.ones(x.shape[0]), x]
theta_i = [-15, 40] + np.random.rand(2)
history, cost, preds, iters = gradient_descent(xaug, y, theta_i)
theta = history[-1]
theta
plt.plot(range(len(cost)), cost);
def init():
line.set_data([], [])
return line,
def animate(i):
ys = preds[i]
line.set_data(xaug[:, 1], ys)
return line,
fig = plt.figure(figsize=(10,6))
ax = plt.axes(xlim=(-3, 2.5), ylim=(-170, 170))
ax.plot(xaug[:,1],y, 'o')
line, = ax.plot([], [], lw=2)
plt.plot(xaug[:,1], best_fit(xaug[:,1]), 'k-', color = "r")
Remember that the linear regression cost function is convex, and more precisely quadratic. We can see the path that gradient descent
takes in arriving at the optimum:
θ := θ − α∇θ Ji (θ)
This stochastic gradient approach allows us to start making progress on the minimization problem right away. It is computationally
cheaper, but it results in a larger variance of the loss function in comparison with batch gradient descent.
Generally, the stochastic gradient descent method will get close to the optimal θ much faster than the batch method, but will never
fully converge to the local (or global) minimum. Thus the stochastic gradient descent method is useful when we want a quick and
dirty approximation for the solution to our optimization problem. A full recipe for stochastic gradient descent follows:
The reshuffling of the data is done to avoid a bias in the optimization algorithm by providing the data examples in a particular order.
In code, the algorithm should look something like this:
for i in range(nb_epochs):
np.random.shuffle(data)
for example in data:
params_grad = evaluate_gradient(loss_function, example, params)
params = params - learning_rate * params_grad
For a given epoch, we first reshuffle the data, and then for a single example, we evaluate the gradient of the loss function and then
update the params with the chosen learning rate.
#print("e/cc",error, currentcost)
if counter % 25 == 0: preds.append(pred)
counter+=1
costsum += currentcost
oldcost = costs[counter-2]
costs.append(costsum/counter)
#print(counter, costs, oldcost)
if maxsteps:
#print("in maxsteps")
if counter == maxsteps:
break
return history, costs, preds, grads, counter, epoch, xs, ys, currentcosts
history2, cost2, preds2, grads2, iters2, epoch2, x2, y2, cc2 = sgd(xaug, y, theta_i, maxsteps=5000, st
[<matplotlib.lines.Line2D at 0x1157d6da0>]
Animating SGD
Here is some code to make an animation of SGD. It shows how the risk surfaces being minimized change, and how the minimum
desired is approached.
def make_3d_plot2(num, it, xfinal, yfinal, zfinal, hist, cost, xaug, y):
ms = np.linspace(xfinal - 20 , xfinal + 20, 20)
bs = np.linspace(yfinal - 50 , yfinal + 50, 40)
M, B = np.meshgrid(ms, bs)
zs = np.array([error2(xaug, y, theta)
for theta in zip(np.ravel(M), np.ravel(B))])
Z = zs.reshape(M.shape)
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection='3d')
for i in range(len(ST)):
#print(history2[i*ST[i]], cc2[i*ST[i]])
make_3d_plot2(i, ST[i], theta[0], theta[1], cost2[-1], [history2[ST[i]]], [cc2[ST[i]]], np.array([
Using Imagemagick we can produce a gif animation: (convert -delay 20 -loop 1 3danim*.png animsgd.gif)
(I set this animation to repeat just once. (loop 1). Reload this cell to see it again. On the web page right clicking the image might allow
for an option to loop again)
(i:i+n) (i:i+n)
θ = θ − η∇θ J (θ; x ;y )
This is what mini-batch gradient descent is about. Using mini-batches has the advantage that the variance in the loss function is
reduced, while the computational burden is still reasonable, since we do not use the full dataset. The size of the mini-batches becomes
another hyper-parameter of the problem. In standard implementations it ranges from 50 to 256. In code, mini-batch gradient descent
looks like this:
for i in range(mb_epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=50):
params_grad = evaluate_gradient(loss_function, batch, params)
params = params - learning_rate * params_grad
The difference with SGD is that for each update we use a batch of 50 examples to estimate the gradient.
Variations on a theme
Momentum
Often, the cost function has ravines near local optima, ie. areas where the shape of the function is significantly steeper in certain
dimensions than in others. This migh result in a slow convergence to the optimum, since standard gradient descent will keep
oscillating about these ravines. In the figures below, the left panel shows convergence without momentum, and the right panel shows
the effect of adding momentum:
<img <img
src="http://sebastianruder.com/content/images/2015/12/without_momentum.gif", src="http://sebastianruder.com/content/images/2015/12/with_momentum.gif"
width=300, height=300> width=300, height=300>
One way to overcome this problem is by using the concept of momentum, which is borrowed from physics. At each iteration, we
remember the update v = Δθ and use this velocity vector (which as the same dimension as θ) in the next update, which is constructed
as a combination of the cost gradient and the previous update: