0% found this document useful (0 votes)

65 views8 pages

Gradient Descent and SGD

Uploaded by

ROHIT ARORA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views8 pages

Gradient Descent and SGD

Uploaded by

ROHIT ARORA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Gradient Descent and SGD

Contents
Example: Linear regression

Batch gradient descent

Stochastic gradient descent

Animating SGD

Mini-batch gradient descent

Variations on a theme

Momentum

Contents

%matplotlib inline
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from scipy import stats

from sklearn.datasets.samples_generator import make_regression

A lot of the animations here were adapted from: http://tillbergmann.com/blog/python-gradient-descent.html

A great discussion (and where momentum image was stolen from) is at http://sebastianruder.com/optimizing-gradient-descent/

Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural
networks. Gradient descent is a way to minimize an objective function J parameterized by a model’s parameters θ ∈ R by updating
θ
d

the parameters in the opposite direction of the gradient of the objective function ∇ J (θ) w.r.t. to the parameters. The learning rate η
J

determines the size of the steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface
created by the objective function downhill until we reach a valley.

There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function.
Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform
an update.

Example: Linear regression

Let’s see briefly how gradient descent can be useful to us in least squares regression. Let’s asssume we have an output variable y
which we think depends linearly on the input vector x. We approximate y by

T
fθ (x) = θ x

The cost function for our linear least squares regression will then be

m
1 (i) (i) 2
J (θ) = ∑(fθ (x − y )
2
i=1

We create a regression problem using sklearn’s make_regression function:

#code adapted from http://tillbergmann.com/blog/python-gradient-descent.html

x, y = make_regression(n_samples = 100,
n_features=1,
n_informative=1,
noise=20,
random_state=2017)

x = x.flatten()

slope, intercept, _,_,_ = stats.linregress(x,y)

best_fit = np.vectorize(lambda x: x * slope + intercept)

plt.plot(x,y, 'o', alpha=0.5)

grid = np.arange(-3,3,0.1)
plt.plot(grid,best_fit(grid), '.')

[<matplotlib.lines.Line2D at 0x108279358>]
Batch gradient descent
Assume that we have a vector of paramters θ and a cost function J (θ) which is simply the variable we want to minimize (our
objective function). Typically, we will find that the objective function has the form:

J (θ) = ∑ Ji (θ)

i=1

where J is associated with the i-th observation in our data set. The batch gradient descent algorithm, starts with some initial feasible
i

θ (which we can either fix or assign randomly) and then repeatedly performs the update:

θ := θ − η∇θ J (θ) = θ − η ∑ ∇Ji (θ)

i=1

where η is a constant controlling step-size and is called the learning rate. Note that in order to make a single update, we need to
calculate the gradient using the entire dataset. This can be very inefficient for large datasets.

In code, batch gradient descent looks like this:

for i in range(n_epochs): 
params_grad = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * params_grad` 

For a given number of epochs n , we first evaluate the gradient vector of the loss function using ALL examples in the data set,
epochs

and then we update the parameters with a given learning rate. This is where Theano and automatic differentiation come in handy,
and you will learn about them in lab.

Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-
convex surfaces.

In the linear example it’s easy to see that our update step then takes the form:

for every j (note θ is simply the j-th component of the θ vector).

m (i) (i) (i)
θj := θj + α ∑ (y − f θ (x ))x j
i=1 j

def gradient_descent(x, y, theta_init, step=0.001, maxsteps=0, precision=0.001, ):

costs = []
m = y.size # number of data points
theta = theta_init
history = [] # to store all thetas
preds = []
counter = 0
oldcost = 0
pred = np.dot(x, theta)
error = pred - y
currentcost = np.sum(error ** 2) / (2 * m)
preds.append(pred)
costs.append(currentcost)
history.append(theta)
counter+=1
while abs(currentcost - oldcost) > precision:
oldcost=currentcost
gradient = x.T.dot(error)/m
theta = theta - step * gradient # update
history.append(theta)

pred = np.dot(x, theta)

error = pred - y
currentcost = np.sum(error ** 2) / (2 * m)
costs.append(currentcost)

if counter % 25 == 0: preds.append(pred)
counter+=1
if maxsteps:
if counter == maxsteps:
break
return history, costs, preds, counter

np.random.rand(2)

array([ 0.75307882, 0.98388838])

xaug = np.c_[np.ones(x.shape[0]), x]
theta_i = [-15, 40] + np.random.rand(2)
history, cost, preds, iters = gradient_descent(xaug, y, theta_i)
theta = history[-1]

print("Gradient Descent: {:.2f}, {:.2f} {:d}".format(theta[0], theta[1], iters))

print("Least Squares: {:.2f}, {:.2f}".format(intercept, slope))

Gradient Descent: -3.93, 81.67 4454

Least Squares: -3.71, 82.90

theta

array([ -3.92778924, 81.67155225])

One can plot the reduction of cost:

plt.plot(range(len(cost)), cost);

The following animation shows how the regression line forms:

from JSAnimation import IPython_display

def init():
line.set_data([], [])
return line,

def animate(i):
ys = preds[i]
line.set_data(xaug[:, 1], ys)
return line,

fig = plt.figure(figsize=(10,6))
ax = plt.axes(xlim=(-3, 2.5), ylim=(-170, 170))
ax.plot(xaug[:,1],y, 'o')
line, = ax.plot([], [], lw=2)
plt.plot(xaug[:,1], best_fit(xaug[:,1]), 'k-', color = "r")

anim = animation.FuncAnimation(fig, animate, init_func=init,

frames=len(preds), interval=100)
anim.save('images/gdline.mp4')
anim

IOPub data rate exceeded.

The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Remember that the linear regression cost function is convex, and more precisely quadratic. We can see the path that gradient descent
takes in arriving at the optimum:

from mpl_toolkits.mplot3d import Axes3D

def error(X, Y, THETA):

return np.sum((X.dot(THETA) - Y)**2)/(2*Y.size)

def make_3d_plot(xfinal, yfinal, zfinal, hist, cost, xaug, y):

ms = np.linspace(xfinal - 20 , xfinal + 20, 20)
bs = np.linspace(yfinal - 40 , yfinal + 40, 40)
M, B = np.meshgrid(ms, bs)
zs = np.array([error(xaug, y, theta)
for theta in zip(np.ravel(M), np.ravel(B))])
Z = zs.reshape(M.shape)
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(M, B, Z, rstride=1, cstride=1, color='b', alpha=0.1)
ax.contour(M, B, Z, 20, color='b', alpha=0.5, offset=0, stride=30)
ax.set_xlabel('Intercept')
ax.set_ylabel('Slope')
ax.set_zlabel('Cost')
ax.view_init(elev=30., azim=30)
ax.plot([xfinal], [yfinal], [zfinal] , markerfacecolor='r', markeredgecolor='r', marker='o', marke
ax.plot([t[0] for t in hist], [t[1] for t in hist], cost , markerfacecolor='b', markeredgecolor='b
ax.plot([t[0] for t in hist], [t[1] for t in hist], 0 , alpha=0.5, markerfacecolor='r', markeredge

def gd_plot(xaug, y, theta, cost, hist):

make_3d_plot(theta[0], theta[1], cost[-1], hist, cost, xaug, y)

gd_plot(xaug, y, theta, cost, history)

 

Stochastic gradient descent

As noted, the gradient descent algorithm makes intuitive sense as it always proceeds in the direction of steepest descent (the gradient
of J ) and guarantees that we find a local minimum (global under certain assumptions on J ). When we have very large data sets,
however, the calculation of ∇(J (θ)) can be costly as we must process every data point before making a single step (hence the name
“batch”). An alternative approach, the stochastic gradient descent method, is to update θ sequentially with every observation. The
updates then take the form:

θ := θ − α∇θ Ji (θ)

This stochastic gradient approach allows us to start making progress on the minimization problem right away. It is computationally
cheaper, but it results in a larger variance of the loss function in comparison with batch gradient descent.

Generally, the stochastic gradient descent method will get close to the optimal θ much faster than the batch method, but will never
fully converge to the local (or global) minimum. Thus the stochastic gradient descent method is useful when we want a quick and
dirty approximation for the solution to our optimization problem. A full recipe for stochastic gradient descent follows:

Initialize the parameter vector θ and set the learning rate α

Repeat until an acceptable approximation to the minimum is obtained:

Randomly reshuffle the instances in the training data.

For i = 1, 2, … m do: θ := θ − α∇ θ J i (θ)

The reshuffling of the data is done to avoid a bias in the optimization algorithm by providing the data examples in a particular order.
In code, the algorithm should look something like this:

for i in range(nb_epochs):
np.random.shuffle(data)
for example in data:
params_grad = evaluate_gradient(loss_function, example, params)
params = params - learning_rate * params_grad

For a given epoch, we first reshuffle the data, and then for a single example, we evaluate the gradient of the loss function and then
update the params with the chosen learning rate.

The update for linear regression is:

(i) (i) (i)

θ j := θ j + α(y − f θ (x ))x
j

def sgd(x, y, theta_init, step=0.001, maxsteps=0, precision=0.001, ):

costs = []
currentcosts = []
m = y.size # number of data points
oldtheta = 0
theta = theta_init
history = [] # to store all thetas
preds = []
grads = []
xs = []
ys = []
counter = 0
oldcost = 0
epoch = 0
i = 0 #index
xs.append(x[i,:])
ys.append([y[i]])
pred = np.dot(x[i,:], theta)
error = pred - y[i]
gradient = x[i,:].T*error
grads.append(gradient)
currentcost = np.sum(error ** 2) / 2
print("Init", gradient, x[i,:],y[i])
print ("Init2", currentcost, theta)
currentcosts.append(currentcost)
counter+=1
preds.append(pred)
costsum = currentcost
costs.append(costsum/counter)
history.append(theta)
print("start",counter, costs, oldcost)
while 1:
#while abs(costs[counter-1] - oldcost) > precision:
#while np.linalg.norm(theta - oldtheta) > precision:
gradient = x[i,:].T*error
grads.append(gradient)
oldtheta = theta
theta = theta - step * gradient # update
history.append(theta)
i += 1
if i == m:#reached one past the end.
#break
epoch +=1
neworder = np.random.permutation(m)
x = x[neworder]
y = y[neworder]
i = 0
xs.append(x[i,:])
ys.append(y[i])
pred = np.dot(x[i,:], theta)
error = pred - y[i]
currentcost = np.sum(error ** 2) / 2
currentcosts.append(currentcost)

#print("e/cc",error, currentcost)
if counter % 25 == 0: preds.append(pred)
counter+=1
costsum += currentcost
oldcost = costs[counter-2]
costs.append(costsum/counter)
#print(counter, costs, oldcost)
if maxsteps:
#print("in maxsteps")
if counter == maxsteps:
break

return history, costs, preds, grads, counter, epoch, xs, ys, currentcosts

history2, cost2, preds2, grads2, iters2, epoch2, x2, y2, cc2 = sgd(xaug, y, theta_i, maxsteps=5000, st

 

Init [-24.81938748 -0.8005103 ] [ 1. 0.03225343] 11.5348518902

Init2 308.000997364 [-14.58474841 40.31239274]
start 1 [308.00099736389291] 0

print(iters2, history2[-1], epoch2, grads2[-1])

5000 [ -3.18011506 82.93875465] 49 [ 10.47922297 -5.63332413]

plt.plot(range(len(cost2[-10000:])), cost2[-10000:], alpha=0.4);

gd_plot(xaug, y, theta, cost2, history2)

plt.plot([t[0] for t in history2], [t[1] for t in history2],'o-', alpha=0.1)

[<matplotlib.lines.Line2D at 0x1157d6da0>]

Animating SGD
Here is some code to make an animation of SGD. It shows how the risk surfaces being minimized change, and how the minimum
desired is approached.

def error2(X, Y, THETA):

#print("XYT", THETA, np.sum((X.dot(THETA) - Y)**2))
return np.sum((X.dot(THETA) - Y)**2)/(2*Y.size)

def make_3d_plot2(num, it, xfinal, yfinal, zfinal, hist, cost, xaug, y):
ms = np.linspace(xfinal - 20 , xfinal + 20, 20)
bs = np.linspace(yfinal - 50 , yfinal + 50, 40)
M, B = np.meshgrid(ms, bs)
zs = np.array([error2(xaug, y, theta)
for theta in zip(np.ravel(M), np.ravel(B))])
Z = zs.reshape(M.shape)
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection='3d')

ax.plot_surface(M, B, Z, rstride=1, cstride=1, color='b', alpha=0.1)

ax.contour(M, B, Z, 20, color='b', alpha=0.5, offset=0, stride=30)
ax.set_xlabel('Intercept')
ax.set_ylabel('Slope')
ax.set_zlabel('Cost')
ax.view_init(elev=30., azim=30)
#print("hist", xaug, y, hist, cost)
ax.plot([xfinal], [yfinal], [zfinal] , markerfacecolor='r', markeredgecolor='r', marker='o', marke
#ax.plot([t[0] for t in hist], [t[1] for t in hist], cost , markerfacecolor='b', markeredgecolor='
ax.plot([t[0] for t in hist], [t[1] for t in hist], 0 , alpha=0.5, markerfacecolor='r', markeredge
ax.set_zlim([0, 3000])
plt.title("Iteration {}".format(it))
plt.savefig("images/3danim{0:03d}.png".format(num))
plt.close()
 

print("fthetas",theta[0], theta[1], "len", len(history2))

ST = list(range(0, 750, 10)) + list(range(750, 5000, 250))
len(ST)

fthetas -3.92778923799 81.6715522496 len 5000

for i in range(len(ST)):
#print(history2[i*ST[i]], cc2[i*ST[i]])
make_3d_plot2(i, ST[i], theta[0], theta[1], cost2[-1], [history2[ST[i]]], [cc2[ST[i]]], np.array([
 

Using Imagemagick we can produce a gif animation: (convert -delay 20 -loop 1 3danim*.png animsgd.gif)

(I set this animation to repeat just once. (loop 1). Reload this cell to see it again. On the web page right clicking the image might allow
for an option to loop again)

Mini-batch gradient descent

What if instead of single example from the dataset, we use a batch of data examples witha given size every time we calculate the
gradient:

(i:i+n) (i:i+n)
θ = θ − η∇θ J (θ; x ;y )

This is what mini-batch gradient descent is about. Using mini-batches has the advantage that the variance in the loss function is
reduced, while the computational burden is still reasonable, since we do not use the full dataset. The size of the mini-batches becomes
another hyper-parameter of the problem. In standard implementations it ranges from 50 to 256. In code, mini-batch gradient descent
looks like this:

for i in range(mb_epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=50):
params_grad = evaluate_gradient(loss_function, batch, params)
params = params - learning_rate * params_grad

The difference with SGD is that for each update we use a batch of 50 examples to estimate the gradient.

Variations on a theme
Momentum
Often, the cost function has ravines near local optima, ie. areas where the shape of the function is significantly steeper in certain
dimensions than in others. This migh result in a slow convergence to the optimum, since standard gradient descent will keep
oscillating about these ravines. In the figures below, the left panel shows convergence without momentum, and the right panel shows
the effect of adding momentum:

<img <img
src="http://sebastianruder.com/content/images/2015/12/without_momentum.gif", src="http://sebastianruder.com/content/images/2015/12/with_momentum.gif"
width=300, height=300> width=300, height=300>

One way to overcome this problem is by using the concept of momentum, which is borrowed from physics. At each iteration, we
remember the update v = Δθ and use this velocity vector (which as the same dimension as θ) in the next update, which is constructed
as a combination of the cost gradient and the previous update:

vt = γvt−1 + η∇θ J (θ) θ = θ − vt

The effect of this is the following: the momentum terms increases for dimensions whose gradients point in the same direction, and
reduces the importance of dimensions whose gradients change direction. This avoids oscillations and improves the chances of rapid
convergence. The concept is analog to the a rock rolling down a hill: the gravitational field (cost function) accelerates the particule
(weights vector), which accumulates momentum, becomes faster and faster and tends to keep travelling in the same direction. A
commonly used value for the momentum parameter is γ = 0.5.

Benham Rise Final Powerpoint Presentation
No ratings yet
Benham Rise Final Powerpoint Presentation
16 pages
S.V Reg. in Asme TDP 1, Asme Sec 1, b31.1
No ratings yet
S.V Reg. in Asme TDP 1, Asme Sec 1, b31.1
9 pages
Rear Differential Information
100% (2)
Rear Differential Information
5 pages
CROWN RD5700 - Spec - GB
No ratings yet
CROWN RD5700 - Spec - GB
6 pages
Susanna Eken - UK
100% (1)
Susanna Eken - UK
86 pages
Atomic Habits by James Clear
100% (1)
Atomic Habits by James Clear
23 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Achievement Test Answer Key BPP NC Ii SHS
100% (1)
Achievement Test Answer Key BPP NC Ii SHS
6 pages
Samuel Murphy Case Study Firms and Markets
100% (1)
Samuel Murphy Case Study Firms and Markets
21 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
Cardiopulmonary Resuscitation CPR
No ratings yet
Cardiopulmonary Resuscitation CPR
18 pages
Money Market Instruments in Pakistan
No ratings yet
Money Market Instruments in Pakistan
48 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
1 - Chemicals in The Workplace
No ratings yet
1 - Chemicals in The Workplace
58 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
cs229 2
No ratings yet
cs229 2
275 pages
Brazil Baby Food
No ratings yet
Brazil Baby Food
9 pages
899 Coulometer: Manual
No ratings yet
899 Coulometer: Manual
124 pages
Thermal Insulation Barrier Providing Corrosion Protection With "Cool-To-Touch" Properties
No ratings yet
Thermal Insulation Barrier Providing Corrosion Protection With "Cool-To-Touch" Properties
2 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
M.E. Production Engineering - Manufacturing &amp Automation
No ratings yet
M.E. Production Engineering - Manufacturing &amp Automation
41 pages
An Introduction To Gradient Descent and Linear Regression
No ratings yet
An Introduction To Gradient Descent and Linear Regression
8 pages
1 Tutorial: Linear Regression
No ratings yet
1 Tutorial: Linear Regression
8 pages
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
No ratings yet
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
43 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
3 Types of Gradient Descent Algorithms For Small & Large Datasets
No ratings yet
3 Types of Gradient Descent Algorithms For Small & Large Datasets
9 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Gradient Descent Vizcs229 PDF
No ratings yet
Gradient Descent Vizcs229 PDF
7 pages
Linearna Regresija - NG
No ratings yet
Linearna Regresija - NG
7 pages
CSCE 5063-001: Assignment 2: 1 Implementation of SVM Via Gradient Descent
No ratings yet
CSCE 5063-001: Assignment 2: 1 Implementation of SVM Via Gradient Descent
5 pages
Linear Regression
No ratings yet
Linear Regression
14 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
Areas of Social Sciences PDF
No ratings yet
Areas of Social Sciences PDF
1 page
Scott Slaybaugh - Who Is To Blame? (Titanic Articles)
No ratings yet
Scott Slaybaugh - Who Is To Blame? (Titanic Articles)
8 pages
CC Aws Splunk Brochure
No ratings yet
CC Aws Splunk Brochure
15 pages
Incorporating-Depth-dependency-in-QI - Through-Statistical-RP
No ratings yet
Incorporating-Depth-dependency-in-QI - Through-Statistical-RP
4 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
Lab5 Linear Regression
No ratings yet
Lab5 Linear Regression
1 page
Distribution Characteristics of Quaternary Channel
No ratings yet
Distribution Characteristics of Quaternary Channel
14 pages
Homework of Basic Communication Skills Lecture 1
No ratings yet
Homework of Basic Communication Skills Lecture 1
2 pages
Note Splines
No ratings yet
Note Splines
52 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
MUSFIRAH 15273 Full
No ratings yet
MUSFIRAH 15273 Full
104 pages
Lect 7 - Gradient Descent
No ratings yet
Lect 7 - Gradient Descent
13 pages
Yash 21bsds12
No ratings yet
Yash 21bsds12
3 pages
3 DSeismic Waveform Classification
No ratings yet
3 DSeismic Waveform Classification
5 pages
Regression
No ratings yet
Regression
30 pages
Lecture10 p1
No ratings yet
Lecture10 p1
42 pages
Gradient Descent
No ratings yet
Gradient Descent
58 pages
Climate of India - Wikipedia
No ratings yet
Climate of India - Wikipedia
146 pages
Zhang - How Small Slip Surfaces Evolve Into Large Submarine Landslides Insight From 3D Numerical
No ratings yet
Zhang - How Small Slip Surfaces Evolve Into Large Submarine Landslides Insight From 3D Numerical
24 pages
Chen - Sediment Dispersal and Redistributive Processes in Axial and Transverse Deep Time
No ratings yet
Chen - Sediment Dispersal and Redistributive Processes in Axial and Transverse Deep Time
23 pages
Best Shortestpath
No ratings yet
Best Shortestpath
142 pages
Good Thesis
No ratings yet
Good Thesis
85 pages
Dynamic Ray Trace
No ratings yet
Dynamic Ray Trace
79 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
02 Rendering Zsolnai Ray Tracing
No ratings yet
02 Rendering Zsolnai Ray Tracing
388 pages
Homework - Plate No. 4b
No ratings yet
Homework - Plate No. 4b
1 page
Gaussian Beamlet 2022 0
No ratings yet
Gaussian Beamlet 2022 0
24 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Thesis 1392349859
No ratings yet
Thesis 1392349859
16 pages
Ray Theoretical Traveltime Inversion of Seismic Data in Two-Dimen
No ratings yet
Ray Theoretical Traveltime Inversion of Seismic Data in Two-Dimen
159 pages
Notional Ghosts
No ratings yet
Notional Ghosts
5 pages
cs188 Fa23 Note23
No ratings yet
cs188 Fa23 Note23
2 pages
Testing
No ratings yet
Testing
87 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Jge5 1 008
No ratings yet
Jge5 1 008
10 pages
Gradient Descent Algorithm
No ratings yet
Gradient Descent Algorithm
6 pages
Understanding Weight Initialization For Neural Networks - PyImageSearch
No ratings yet
Understanding Weight Initialization For Neural Networks - PyImageSearch
16 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Goyes 2021 Consensus
No ratings yet
Goyes 2021 Consensus
6 pages
14-RMSProp and Adam Optimization-12!08!2024
No ratings yet
14-RMSProp and Adam Optimization-12!08!2024
2 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
48 pages
A&P Chapter 12 Notes
No ratings yet
A&P Chapter 12 Notes
10 pages
Sheet 3 Sol 3
No ratings yet
Sheet 3 Sol 3
3 pages
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
3 Days Trip in Ujjain, Madhya Pradesh, India
No ratings yet
3 Days Trip in Ujjain, Madhya Pradesh, India
2 pages
Gradient Descent From Scratch Complete Intuition
No ratings yet
Gradient Descent From Scratch Complete Intuition
8 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
Experiment 1
No ratings yet
Experiment 1
15 pages
MFD S Assignment 2
No ratings yet
MFD S Assignment 2
18 pages
05 Gradient Descent
No ratings yet
05 Gradient Descent
23 pages
Mayhoc
No ratings yet
Mayhoc
51 pages
Machine Learning Notes by Standard Andrew NG
No ratings yet
Machine Learning Notes by Standard Andrew NG
142 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
Lecture3 Upload
No ratings yet
Lecture3 Upload
28 pages
Visang High School English Lesson 04 WooJack SKUNK WORKS Q113
No ratings yet
Visang High School English Lesson 04 WooJack SKUNK WORKS Q113
50 pages
CS229
No ratings yet
CS229
69 pages
07 Gradient Descent For Linear Regression 10 Min
No ratings yet
07 Gradient Descent For Linear Regression 10 Min
5 pages
Brewers Fayre Main Menu Band4
No ratings yet
Brewers Fayre Main Menu Band4
8 pages
03 SR Iit - Co-Sc GTM-13 (N) Main (Model-A - B - C) - 11-01-2024 - 2197346
No ratings yet
03 SR Iit - Co-Sc GTM-13 (N) Main (Model-A - B - C) - 11-01-2024 - 2197346
12 pages
C1 W1 Lab04 Gradient Descent Soln
No ratings yet
C1 W1 Lab04 Gradient Descent Soln
11 pages
Engineering Mechanics - ME3351 2021 Regulation - Semester Question Paper 2022 Nov Dec
No ratings yet
Engineering Mechanics - ME3351 2021 Regulation - Semester Question Paper 2022 Nov Dec
5 pages
Updating Weight
No ratings yet
Updating Weight
9 pages
Headache Center Diary and Guide
No ratings yet
Headache Center Diary and Guide
3 pages
312006-Basic Mechanical Engineering 281223
No ratings yet
312006-Basic Mechanical Engineering 281223
7 pages
Chapter04 Training Models
No ratings yet
Chapter04 Training Models
33 pages
Gradient Descent
No ratings yet
Gradient Descent
16 pages
JBL 8340a.v6
No ratings yet
JBL 8340a.v6
2 pages
Unit VI Optimization Techniques Question Bank Solved Answer
No ratings yet
Unit VI Optimization Techniques Question Bank Solved Answer
20 pages
Module 4 Lab 3
No ratings yet
Module 4 Lab 3
6 pages
Bougher 2015 Introduction To Compressed Sensing
No ratings yet
Bougher 2015 Introduction To Compressed Sensing
2 pages
Ray-Based Seismic Modeling
No ratings yet
Ray-Based Seismic Modeling
20 pages
Regression
No ratings yet
Regression
25 pages
Schleicher 2018 The Conjugate Gradient Method
No ratings yet
Schleicher 2018 The Conjugate Gradient Method
3 pages
Amato Del Monte 2015 Seismic Petrophysics Part 2
No ratings yet
Amato Del Monte 2015 Seismic Petrophysics Part 2
3 pages
W2 Lab04 Gradient Descent Problem Assignment
No ratings yet
W2 Lab04 Gradient Descent Problem Assignment
10 pages
Gradient Descent Algorithm.Y...
No ratings yet
Gradient Descent Algorithm.Y...
10 pages
Notes 1
No ratings yet
Notes 1
30 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet

Gradient Descent and SGD

Uploaded by

Gradient Descent and SGD

Uploaded by

Gradient Descent and SGD

Batch gradient descent

Stochastic gradient descent

Mini-batch gradient descent

from sklearn.datasets.samples_generator import make_regression

A lot of the animations here were adapted from: http://tillbergmann.com/blog/python-gradient-descent.html

Example: Linear regression

We create a regression problem using sklearn’s make_regression function:

#code adapted from http://tillbergmann.com/blog/python-gradient-descent.html

slope, intercept, _,_,_ = stats.linregress(x,y)

plt.plot(x,y, 'o', alpha=0.5)

θ := θ − η∇θ J (θ) = θ − η ∑ ∇Ji (θ)

In code, batch gradient descent looks like this:

for every j (note θ is simply the j-th component of the θ vector).

def gradient_descent(x, y, theta_init, step=0.001, maxsteps=0, precision=0.001, ):

pred = np.dot(x, theta)

array([ 0.75307882, 0.98388838])

print("Gradient Descent: {:.2f}, {:.2f} {:d}".format(theta[0], theta[1], iters))

Gradient Descent: -3.93, 81.67 4454

array([ -3.92778924, 81.67155225])

One can plot the reduction of cost:

The following animation shows how the regression line forms:

from JSAnimation import IPython_display

anim = animation.FuncAnimation(fig, animate, init_func=init,

IOPub data rate exceeded.

from mpl_toolkits.mplot3d import Axes3D

def error(X, Y, THETA):

def make_3d_plot(xfinal, yfinal, zfinal, hist, cost, xaug, y):

def gd_plot(xaug, y, theta, cost, hist):

gd_plot(xaug, y, theta, cost, history)

Stochastic gradient descent

Initialize the parameter vector θ and set the learning rate α

Repeat until an acceptable approximation to the minimum is obtained:

Randomly reshuffle the instances in the training data.

For i = 1, 2, … m do: θ := θ − α∇ θ J i (θ)

The update for linear regression is:

(i) (i) (i)

def sgd(x, y, theta_init, step=0.001, maxsteps=0, precision=0.001, ):

Init [-24.81938748 -0.8005103 ] [ 1. 0.03225343] 11.5348518902

print(iters2, history2[-1], epoch2, grads2[-1])

5000 [ -3.18011506 82.93875465] 49 [ 10.47922297 -5.63332413]

plt.plot(range(len(cost2[-10000:])), cost2[-10000:], alpha=0.4);

gd_plot(xaug, y, theta, cost2, history2)

def error2(X, Y, THETA):

ax.plot_surface(M, B, Z, rstride=1, cstride=1, color='b', alpha=0.1)

print("fthetas",theta[0], theta[1], "len", len(history2))

fthetas -3.92778923799 81.6715522496 len 5000

Mini-batch gradient descent

vt = γvt−1 + η∇θ J (θ) θ = θ − vt

You might also like