0% found this document useful (0 votes)

59 views9 pages

L23 Stochastic Gradient and Mini Batch

This lecture discusses stochastic optimization methods for maximizing an objective function. It covers stochastic gradient ascent, where the gradient is estimated using a single random data point, and mini-batch stochastic gradient ascent, which uses a random batch of data. Logistic regression is provided as an example application of these methods. Code is presented to implement stochastic gradient ascent for logistic regression to estimate model parameters.

Uploaded by

Ananya Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views9 pages

L23 Stochastic Gradient and Mini Batch

Uploaded by

Ananya Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

MTH 511a - 2020: Lecture 23

Instructor: Dootika Vats

The instructor of this course owns the copyright of all the course materials. This lecture
material was distributed only to the students attending the course MTH511a: “Statistical
Simulation and Data Analysis” of IIT Kanpur, and should not be distributed in print or
through electronic media without the consent of the instructor. Students can make their own
copies of the course materials for their use.

1 Stochastic optimization methods

We go back to optimization this week. The reason we took a break from optimization is
because we will focus on stochastic optimization methods, which will lead the discussion
into other stochastic methods.
We will cover two topics:
1. Stochastic gradient ascent
2. Simulated annealing
Our goal is the same as before: for an objective function f (✓), our goal is to find
✓⇤ = arg max f (✓) .
✓

1.1 Stochastic gradient ascent

Recall, in order to maximize the objective function the gradient ascent algorithm does
the following update:
✓(k+1) = ✓(k) + trf (✓(k) ) ,
where rf (✓(k) ) is the gradient vector. Now, since in many statistics problems, the
objective function is the log-likelihood (for some density f˜),
1X h i
n n
1X ˜ ˜
f (✓) = n log f (✓|xi ) ) rf (✓) = r n log f (✓|xi ) .
n i=1 n i=1
That is, in order to implement a gradient ascent step, the gradient of the log-likelihood
is calculated for the whole data. However, consider the following two situations

1
• the data size n and/or dimension of ✓ are prohibitively large so that calculating
the full gradient multiple times is infeasible
• the data is not available at once! In many online data situations, the full data set
is not available, but comes in sequentially. Then, the full data gradient vector is
not available.
In such situations, when the full gradient vector is unavailable, our goal is to estimate
the gradient. Suppose ik is a randomly chosen index in {1, . . . , n}. Then
h h ii 1 Xh
n i
E r n log f˜(✓|xik ) = nr log f˜(✓|xi ) .
n i=1
h i
Thus, r log nf˜(✓|xik ) is an unbiased estimator of the complete gradient, but uses
only one data point. Replacing the complete gradient with this estimate yields the
stochastic gradient ascent update:
h h ii
˜
✓(k+1) = ✓(k) + t r n log f (✓(k) |xik ) ,

where ik is a randomly chosen index. This randomness in choosing the index makes
this a stochastic algorithm.
• advantage: it is much cheaper to implement since only one-data point is required
for gradient evaluation
• disadvantage it may require larger k for convergence to the optimal solution
• disadvantage as k increases, ✓(k+1) 6! ✓⇤ . Rather, after some initial steps, ✓(k+1)
oscillates around ✓⇤ .

After K iterations, the final estimate of ✓⇤ is

K
ˆ⇤ 1 X
✓ = ✓(k+1) .
K k=1

However, since each step involves evaluating only data gradient, variability in subse-
quent updates of ✓(k+) increases. To stabilize this behavior, often mini-batch stochas-
tic gradient is used.

2
1.1.1 Mini-batch stochastic gradient ascent

Let Ik be a random subset of {1, . . . , n} of size b. Then, the mini-batch stochastic

gradient ascent algorithm implements the following update:
" #
1X h i
✓(k+1) = ✓(k) + t r n log f˜(✓(k) |xi ) .
b i2I
k

The mini-batch stochastic gradient estimate of ✓⇤ after K updates is

K
ˆ⇤ 1 X
✓ = ✓(k) .
K k=1

There are not a lot of clear rules about terminating the algorithm in stochastic gra-
dient. Typically, the number of iterations K = n, so that one full pass at the data is
implemented.

1.1.2 Logistic regression

Recall the logistic regression setup where for a response Y and a covariate matrix X,
T
!
e xi
Yi ⇠ Bern T .
1 + e xi

In order to find the MLE for , we obtain the log-likelihood.

n
Y
L( |Y ) = (pi )yi (1 pi ) 1 yi

i=1
n
X n
X
) n log f˜( ) = n log 1 + exp(xTi ) +n yi xTi
i=1 i=1

Taking derivative:
" #
h i n
X T
exi
r n log f˜( ) = n xi yi T .
i=1
1 + e xi

As noted earlier, the target objective is concave, thus a global optima exists and the
gradient ascent algorithm will converge to the MLE. We will implement the stochastic
gradient ascent algorithm here.

3
################################################
## MLE for logistic regression
## Using stochastic gradient ascent
################################################

f.gradient <- function(y, X, beta)

{
n <- dim(X)[1]
beta <- matrix(beta, ncol = 1)
pi <- exp(X %*% beta) / (1 + exp(X%*%beta))
rtn <- colSums(X* as.numeric(y - pi))
return(n*rtn)
}

#################################################
## The following is a general function that
## implements the regular gradient ascent
## the stochastic gradient ascent and
## mini-batch stochastic gradient ascent
#################################################
SGA <- function(y, X, batch.size = dim(X)[1], t = .1, max.iter = dim(X)[1],
adapt = FALSE)
{
p <- dim(X)[2]
n <- dim(X)[1]

# create the mini-batches

permutation <- sample(1:n, replace = FALSE)
K <- floor(n/batch.size)
batch.index <- split(permutation, rep(1:K, length = n, each = n/K))

# index for choosing the mini-batch

count <- 1

beta_k <- rep(0, p) # start at all 0s

track.gradient <- matrix(0, nrow = max.iter, ncol = p)
track.gradient[1,] <- f.gradient(y = y, X= X, beta = beta_k)

# saving the running mean of the estimates of theta^*

mean_beta <- rep(0,p)

# tk: in case we want t_k

tk <- t

# ideally, we will have a while loop here, but

# I have written this to always complete some max.iter steps

4
for(iter in 1:max.iter)
{
count <- count+1

if(adapt) tk <- t/(sqrt(iter)) # in case t_k

if(count %% K == 0) count <- count%%K +1 # when all batches finish,
restart the batches
if(iter %% (max.iter/10) == 0) print(iter) #feedback

# batch of data
y.batch <- y[batch.index[[count]] ]
X.batch <- matrix(X[batch.index[[count]], ], nrow = batch.size)

# SGA step
beta_k = beta_k + tk* f.gradient(y = y.batch, X = X.batch, beta =
beta_k)/batch.size

# saving overall estimates and running gradients for demonstration

mean_beta <- (beta_k + mean_beta*(iter - 1))/(iter)
if(batch.size == n)
{
est <- beta_k
}else{
est <- mean_beta
}
track.gradient[iter,] <- f.gradient(y = y, X = X, beta = est)/n
}
rtn <- list("iter" = iter, "est" = est, "grad" = track.gradient[1:iter,])
return(rtn)
}

Next, I will generate data from the logistic regression model in order to demonstrate
the performance of the stochastic gradient ascent algorithm.
# Generating data for demonstration
set.seed(10)
p <- 5
n <- 1e4
X <- matrix(rnorm(n*(p-1)), nrow = n, ncol = p-1)
X <- cbind(1, X)
beta <- matrix(rnorm(p, 0, sd = 1), ncol = 1)
p <- exp(X %*% beta)/(1 + exp(X%*%beta))
y <- rbinom(n, size = 1, prob = p)

We implement the stochastic gradient ascent algorithm and keep a track of the running
average of the estimate of ✓⇤ : ✓ˆk⇤ and track the value of the complete gradient at that

5
value krf (✓ˆk⇤ . A running plot of these should tend to 0 as k increases. We will
implement the original gradient ascent, stochastic gradient ascent, and mini-batch
gradient ascent algorithm.
I first run this for tuned values of t.
# Tuned value of t
ga <- SGA(y, X, batch.size = 1e4, t = .0015, max.iter = 1e3)
b1 <- SGA(y, X, batch.size = 1, t = .1, max.iter = ga$iter)
b10 <- SGA(y, X, batch.size = 10, t = .1, max.iter = ga$iter)
b100 <- SGA(y, X, batch.size = 100, t = .1, max.iter = ga$iter)

index <- 1:500

plot(apply(ga$grad[index,], 1, function(t) sum(abs(t))), type = ’l’, ylim =
c(0,max(apply(b1$grad[,], 1, function(t) sum(abs(t))))), ylab =
"Complete gradient")
lines(apply(b1$grad[index,], 1, function(t) sum(abs(t))), col = "red")
lines(apply(b10$grad[index,], 1, function(t) sum(abs(t))), col = "blue")
lines(apply(b100$grad[index,], 1, function(t) sum(abs(t))), col = "orange")
legend("topright", col = c("black", "red", "blue", "orange"), legend =
c("GA", "SGA", "MB-SGA-10", "MB-SGA-100"), lty = 1)

GA
SGA
MB−SGA−10
6000

MB−SGA−100
Complete gradient

4000
2000
0

0 100 200 300 400 500

Index

We see that the original stochastic gradient ascent algorithm is slow to converge al-
though it is the cheapest. The mini-batches are far more stable.
Next, we repeat the same algorithm with di↵erent choices of t. These chosen choices of
t are too large so that all the algorithm no longer perform well and essentially oscillate

6
locally.
# all bad values of t
ga <- SGA(y, X, batch.size = n, t = .005, max.iter = 1e3)
b1 <- SGA(y, X, batch.size = 1, t = 1, max.iter = ga$iter)
b10 <- SGA(y, X, batch.size = 10, t = 1, max.iter = ga$iter)
b100 <- SGA(y, X, batch.size = 100, t = 1, max.iter = ga$iter)

index <- 1:1000

GA
SGA
MB−SGA−10
MB−SGA−100
6000
Complete gradient

4000
2000
0

0 200 400 600 800 1000

Index

The original gradient ascent algorithm oscillates drastically. The stochastic versions
seem to be converging away from 0 as well. Thus, the value of t is critical to imple-
menting the (stochastic) gradient ascent algorithms in a stable way.
Note that the oscillations occurs due to a large value of t. But which value of t which
be large and which will be small is difficult to assess in the beginning. It is thus useful
to choose a decreasing sequence tk that reduces the step size and avoids long durations
of getting stuck in an oscillation. Here we will use tk = t/ log(k).

7
ga <- SGA(y, X, batch.size = n, t = .05, max.iter = 1e3, adapt = TRUE)
b1 <- SGA(y, X, batch.size = 1, t = 1, max.iter = ga$iter, adapt = TRUE)
b10 <- SGA(y, X, batch.size = 10, t = 1, max.iter = ga$iter, adapt = TRUE)
b100 <- SGA(y, X, batch.size = 100, t = 1, max.iter = ga$iter, adapt = TRUE)

index <- 1:1000

GA
SGA
MB−SGA−10
MB−SGA−100
Complete gradient

0 200 400 600 800 1000

Index

Note here that although the algorithm begins oscillate, due to decreasing step-sizes,
the algorithm escapes out of local oscillations.

2 Questions to think about

1. Can we change tk adaptively as the algorithm goes along?
2. Try and implement a stochastic Newton-Raphson algorithm following the same

8
lines of reasoning as discussed here.

L31 Bayesian Logistic Regression PDF
No ratings yet
L31 Bayesian Logistic Regression PDF
8 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Logistic
No ratings yet
Logistic
14 pages
MFD S Assignment 2
No ratings yet
MFD S Assignment 2
18 pages
Gradient Descent and SGD
No ratings yet
Gradient Descent and SGD
8 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
22 pages
cs188 Fa23 Note23
No ratings yet
cs188 Fa23 Note23
2 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Adaptive Stochastic Conjugate Gradient For Machine Learning
No ratings yet
Adaptive Stochastic Conjugate Gradient For Machine Learning
14 pages
Stochastic Gradient Descent Algorithm
No ratings yet
Stochastic Gradient Descent Algorithm
6 pages
Lecture Slides - Linear Reg
No ratings yet
Lecture Slides - Linear Reg
34 pages
Stochastic Gradient Descent - Term Paper
No ratings yet
Stochastic Gradient Descent - Term Paper
8 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
3 Types of Gradient Descent Algorithms For Small & Large Datasets
No ratings yet
3 Types of Gradient Descent Algorithms For Small & Large Datasets
9 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
Stochastic Optimization For Machine Learning
No ratings yet
Stochastic Optimization For Machine Learning
59 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
12 pages
14 Efficient Learning
No ratings yet
14 Efficient Learning
7 pages
Gradient Descent and Its Types
No ratings yet
Gradient Descent and Its Types
5 pages
Unit VI Optimization Techniques Question Bank Solved Answer
No ratings yet
Unit VI Optimization Techniques Question Bank Solved Answer
20 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
17 Large Scale Machine Learning PDF
No ratings yet
17 Large Scale Machine Learning PDF
10 pages
Chap 4 2
No ratings yet
Chap 4 2
214 pages
Convex Module B
No ratings yet
Convex Module B
29 pages
04 Optimization
No ratings yet
04 Optimization
62 pages
12-Mini-Batch Gradient Descent - Exponential Weighted Averages-07-08-2024
No ratings yet
12-Mini-Batch Gradient Descent - Exponential Weighted Averages-07-08-2024
2 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
No ratings yet
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
43 pages
03 - Comparing Gradient To Stochastic Gradient - en
No ratings yet
03 - Comparing Gradient To Stochastic Gradient - en
1 page
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
23 pages
Lecture 05 06
No ratings yet
Lecture 05 06
40 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
14-RMSProp and Adam Optimization-12!08!2024
No ratings yet
14-RMSProp and Adam Optimization-12!08!2024
2 pages
Gradient Descent
No ratings yet
Gradient Descent
16 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
NIPS 2011 Non Asymptotic Analysis of Stochastic Approximation Algorithms For Machine Learning Paper
No ratings yet
NIPS 2011 Non Asymptotic Analysis of Stochastic Approximation Algorithms For Machine Learning Paper
9 pages
Topic5 Stoch Grad D Oct202023
No ratings yet
Topic5 Stoch Grad D Oct202023
29 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
3.linear Regression
No ratings yet
3.linear Regression
18 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
23 pages
Lecture 9
No ratings yet
Lecture 9
31 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
771 A18 Lec9
No ratings yet
771 A18 Lec9
129 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
Optimization-Module Iv
No ratings yet
Optimization-Module Iv
7 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
Linear Regression Using Batch Gradient Descent
No ratings yet
Linear Regression Using Batch Gradient Descent
7 pages
MTH210
No ratings yet
MTH210
126 pages
CSCE 5063-001: Assignment 2: 1 Implementation of SVM Via Gradient Descent
No ratings yet
CSCE 5063-001: Assignment 2: 1 Implementation of SVM Via Gradient Descent
5 pages
Berkeley-Tutorial Optimization For Machine Learning-Part1
No ratings yet
Berkeley-Tutorial Optimization For Machine Learning-Part1
37 pages
GD Types
No ratings yet
GD Types
98 pages
Lecture 3 ML - Optimization
No ratings yet
Lecture 3 ML - Optimization
32 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
EE311A 2021 AV Slides L23
No ratings yet
EE311A 2021 AV Slides L23
13 pages
EE311A 2021 AV Slides L21
No ratings yet
EE311A 2021 AV Slides L21
9 pages
Tutorial 5 Solutions
No ratings yet
Tutorial 5 Solutions
13 pages
EE311A 2021 AV Slides L24
No ratings yet
EE311A 2021 AV Slides L24
9 pages
EE311A 2021 AV Slides L20
No ratings yet
EE311A 2021 AV Slides L20
12 pages
Dsa Assignment Theory
No ratings yet
Dsa Assignment Theory
2 pages
EE340A: Electromagnetic Theory: 1 Laplace Transform of Transmission Line Equations
No ratings yet
EE340A: Electromagnetic Theory: 1 Laplace Transform of Transmission Line Equations
7 pages
Principles of Communications - EE320A
No ratings yet
Principles of Communications - EE320A
401 pages
L22 Bootstrap
No ratings yet
L22 Bootstrap
7 pages
L28 Bayseian Linear Regression Linchpin Sampler PDF
No ratings yet
L28 Bayseian Linear Regression Linchpin Sampler PDF
6 pages
L24 Simulated Annelaing
No ratings yet
L24 Simulated Annelaing
9 pages
Em Algo For Multivariate GMM
No ratings yet
Em Algo For Multivariate GMM
9 pages
Midsem cs201 PDF
No ratings yet
Midsem cs201 PDF
1 page
EE320A Solutions For Tutorial 2
No ratings yet
EE320A Solutions For Tutorial 2
14 pages
General Steps of The Finite Element Method
No ratings yet
General Steps of The Finite Element Method
21 pages
SUPPLEMENT System of Linear Equation by Graphing Method
No ratings yet
SUPPLEMENT System of Linear Equation by Graphing Method
4 pages
Putnam 2015-2024
No ratings yet
Putnam 2015-2024
2 pages
Teacher Made Learner Home Task - Template
No ratings yet
Teacher Made Learner Home Task - Template
3 pages
BIT 3209-Lecture 7 Divide and Conquer
No ratings yet
BIT 3209-Lecture 7 Divide and Conquer
27 pages
Unit 5 MCQ
No ratings yet
Unit 5 MCQ
15 pages
CS3401 Algorithms Syllabus
No ratings yet
CS3401 Algorithms Syllabus
3 pages
SR No1
No ratings yet
SR No1
6 pages
Lec 28 Variations in BPNN
100% (1)
Lec 28 Variations in BPNN
20 pages
Mon Monomial Factoring
No ratings yet
Mon Monomial Factoring
32 pages
OR-I Course Guide Book 2014
No ratings yet
OR-I Course Guide Book 2014
3 pages
Cuestionarios IA
No ratings yet
Cuestionarios IA
17 pages
14 Numerical Integration Closed Methods Engr. Labayen
No ratings yet
14 Numerical Integration Closed Methods Engr. Labayen
26 pages
Unit I
No ratings yet
Unit I
20 pages
CSP Nqueen
No ratings yet
CSP Nqueen
17 pages
Lecture 3 Dual Theory Part I 1 PDF
No ratings yet
Lecture 3 Dual Theory Part I 1 PDF
33 pages
Quadratic Equations D1
No ratings yet
Quadratic Equations D1
32 pages
120576-72400-2-PB
No ratings yet
120576-72400-2-PB
15 pages
Ex 9
No ratings yet
Ex 9
2 pages
Aml Lab
No ratings yet
Aml Lab
6 pages
?dsa? Cheatsheets by Princeton - Edu
No ratings yet
?dsa? Cheatsheets by Princeton - Edu
6 pages
Daa Problems
No ratings yet
Daa Problems
17 pages
Operations Research - Exam Helper
No ratings yet
Operations Research - Exam Helper
5 pages
Long Quiz On Rational Algebraic Expressions
0% (1)
Long Quiz On Rational Algebraic Expressions
2 pages
Lesson 15 Solution of A Sequencing Problem
No ratings yet
Lesson 15 Solution of A Sequencing Problem
8 pages
Gauss Elimination, Jordan, Siedel, Jacobi
No ratings yet
Gauss Elimination, Jordan, Siedel, Jacobi
4 pages
Laws Exponent
No ratings yet
Laws Exponent
25 pages
Clustering Activity K Means+ (Age+and+amount)
No ratings yet
Clustering Activity K Means+ (Age+and+amount)
10 pages
Median Order Statistics
No ratings yet
Median Order Statistics
26 pages
Correctness and Complexity Analysis of Quick Sort
No ratings yet
Correctness and Complexity Analysis of Quick Sort
23 pages

L23 Stochastic Gradient and Mini Batch

Uploaded by

L23 Stochastic Gradient and Mini Batch

Uploaded by

MTH 511a - 2020: Lecture 23

Instructor: Dootika Vats

1 Stochastic optimization methods

1.1 Stochastic gradient ascent

After K iterations, the final estimate of ✓⇤ is

Let Ik be a random subset of {1, . . . , n} of size b. Then, the mini-batch stochastic

The mini-batch stochastic gradient estimate of ✓⇤ after K updates is

1.1.2 Logistic regression

In order to find the MLE for , we obtain the log-likelihood.

f.gradient <- function(y, X, beta)

# create the mini-batches

# index for choosing the mini-batch

beta_k <- rep(0, p) # start at all 0s

# saving the running mean of the estimates of theta^*

# tk: in case we want t_k

# ideally, we will have a while loop here, but

if(adapt) tk <- t/(sqrt(iter)) # in case t_k

# saving overall estimates and running gradients for demonstration

index <- 1:500

0 100 200 300 400 500

index <- 1:1000

0 200 400 600 800 1000

index <- 1:1000

0 200 400 600 800 1000

2 Questions to think about

You might also like