L23 Stochastic Gradient and Mini Batch
L23 Stochastic Gradient and Mini Batch
The instructor of this course owns the copyright of all the course materials. This lecture
material was distributed only to the students attending the course MTH511a: “Statistical
Simulation and Data Analysis” of IIT Kanpur, and should not be distributed in print or
through electronic media without the consent of the instructor. Students can make their own
copies of the course materials for their use.
Recall, in order to maximize the objective function the gradient ascent algorithm does
the following update:
✓(k+1) = ✓(k) + trf (✓(k) ) ,
where rf (✓(k) ) is the gradient vector. Now, since in many statistics problems, the
objective function is the log-likelihood (for some density f˜),
1X h i
n n
1X ˜ ˜
f (✓) = n log f (✓|xi ) ) rf (✓) = r n log f (✓|xi ) .
n i=1 n i=1
That is, in order to implement a gradient ascent step, the gradient of the log-likelihood
is calculated for the whole data. However, consider the following two situations
1
• the data size n and/or dimension of ✓ are prohibitively large so that calculating
the full gradient multiple times is infeasible
• the data is not available at once! In many online data situations, the full data set
is not available, but comes in sequentially. Then, the full data gradient vector is
not available.
In such situations, when the full gradient vector is unavailable, our goal is to estimate
the gradient. Suppose ik is a randomly chosen index in {1, . . . , n}. Then
h h ii 1 Xh
n i
E r n log f˜(✓|xik ) = nr log f˜(✓|xi ) .
n i=1
h i
Thus, r log nf˜(✓|xik ) is an unbiased estimator of the complete gradient, but uses
only one data point. Replacing the complete gradient with this estimate yields the
stochastic gradient ascent update:
h h ii
˜
✓(k+1) = ✓(k) + t r n log f (✓(k) |xik ) ,
where ik is a randomly chosen index. This randomness in choosing the index makes
this a stochastic algorithm.
• advantage: it is much cheaper to implement since only one-data point is required
for gradient evaluation
• disadvantage it may require larger k for convergence to the optimal solution
• disadvantage as k increases, ✓(k+1) 6! ✓⇤ . Rather, after some initial steps, ✓(k+1)
oscillates around ✓⇤ .
However, since each step involves evaluating only data gradient, variability in subse-
quent updates of ✓(k+) increases. To stabilize this behavior, often mini-batch stochas-
tic gradient is used.
2
1.1.1 Mini-batch stochastic gradient ascent
There are not a lot of clear rules about terminating the algorithm in stochastic gra-
dient. Typically, the number of iterations K = n, so that one full pass at the data is
implemented.
Recall the logistic regression setup where for a response Y and a covariate matrix X,
T
!
e xi
Yi ⇠ Bern T .
1 + e xi
i=1
n
X n
X
) n log f˜( ) = n log 1 + exp(xTi ) +n yi xTi
i=1 i=1
Taking derivative:
" #
h i n
X T
exi
r n log f˜( ) = n xi yi T .
i=1
1 + e xi
As noted earlier, the target objective is concave, thus a global optima exists and the
gradient ascent algorithm will converge to the MLE. We will implement the stochastic
gradient ascent algorithm here.
3
################################################
## MLE for logistic regression
## Using stochastic gradient ascent
################################################
#################################################
## The following is a general function that
## implements the regular gradient ascent
## the stochastic gradient ascent and
## mini-batch stochastic gradient ascent
#################################################
SGA <- function(y, X, batch.size = dim(X)[1], t = .1, max.iter = dim(X)[1],
adapt = FALSE)
{
p <- dim(X)[2]
n <- dim(X)[1]
4
for(iter in 1:max.iter)
{
count <- count+1
# batch of data
y.batch <- y[batch.index[[count]] ]
X.batch <- matrix(X[batch.index[[count]], ], nrow = batch.size)
# SGA step
beta_k = beta_k + tk* f.gradient(y = y.batch, X = X.batch, beta =
beta_k)/batch.size
Next, I will generate data from the logistic regression model in order to demonstrate
the performance of the stochastic gradient ascent algorithm.
# Generating data for demonstration
set.seed(10)
p <- 5
n <- 1e4
X <- matrix(rnorm(n*(p-1)), nrow = n, ncol = p-1)
X <- cbind(1, X)
beta <- matrix(rnorm(p, 0, sd = 1), ncol = 1)
p <- exp(X %*% beta)/(1 + exp(X%*%beta))
y <- rbinom(n, size = 1, prob = p)
We implement the stochastic gradient ascent algorithm and keep a track of the running
average of the estimate of ✓⇤ : ✓ˆk⇤ and track the value of the complete gradient at that
5
value krf (✓ˆk⇤ . A running plot of these should tend to 0 as k increases. We will
implement the original gradient ascent, stochastic gradient ascent, and mini-batch
gradient ascent algorithm.
I first run this for tuned values of t.
# Tuned value of t
ga <- SGA(y, X, batch.size = 1e4, t = .0015, max.iter = 1e3)
b1 <- SGA(y, X, batch.size = 1, t = .1, max.iter = ga$iter)
b10 <- SGA(y, X, batch.size = 10, t = .1, max.iter = ga$iter)
b100 <- SGA(y, X, batch.size = 100, t = .1, max.iter = ga$iter)
GA
SGA
MB−SGA−10
6000
MB−SGA−100
Complete gradient
4000
2000
0
Index
We see that the original stochastic gradient ascent algorithm is slow to converge al-
though it is the cheapest. The mini-batches are far more stable.
Next, we repeat the same algorithm with di↵erent choices of t. These chosen choices of
t are too large so that all the algorithm no longer perform well and essentially oscillate
6
locally.
# all bad values of t
ga <- SGA(y, X, batch.size = n, t = .005, max.iter = 1e3)
b1 <- SGA(y, X, batch.size = 1, t = 1, max.iter = ga$iter)
b10 <- SGA(y, X, batch.size = 10, t = 1, max.iter = ga$iter)
b100 <- SGA(y, X, batch.size = 100, t = 1, max.iter = ga$iter)
GA
SGA
MB−SGA−10
MB−SGA−100
6000
Complete gradient
4000
2000
0
Index
The original gradient ascent algorithm oscillates drastically. The stochastic versions
seem to be converging away from 0 as well. Thus, the value of t is critical to imple-
menting the (stochastic) gradient ascent algorithms in a stable way.
Note that the oscillations occurs due to a large value of t. But which value of t which
be large and which will be small is difficult to assess in the beginning. It is thus useful
to choose a decreasing sequence tk that reduces the step size and avoids long durations
of getting stuck in an oscillation. Here we will use tk = t/ log(k).
7
ga <- SGA(y, X, batch.size = n, t = .05, max.iter = 1e3, adapt = TRUE)
b1 <- SGA(y, X, batch.size = 1, t = 1, max.iter = ga$iter, adapt = TRUE)
b10 <- SGA(y, X, batch.size = 10, t = 1, max.iter = ga$iter, adapt = TRUE)
b100 <- SGA(y, X, batch.size = 100, t = 1, max.iter = ga$iter, adapt = TRUE)
GA
SGA
MB−SGA−10
MB−SGA−100
Complete gradient
Index
Note here that although the algorithm begins oscillate, due to decreasing step-sizes,
the algorithm escapes out of local oscillations.
8
lines of reasoning as discussed here.