Stochastic Gradient Descent Algorithm
Stochastic Gradient Descent Algorithm
Multiple linear regression model that utilizes multiple features and takes the form
Y = 0 + 1X1+ … + pXp +
Here, the stochastic component represents the noise and is generally assumed to have
normal distribution N (0 , σ 2).
Here = (0, 1, … , p) is the vector of model parameters, also called regression coefficients.
The parameters of the model are typically learned (estimated) using least squares approach,
resulting in
^β=( X ' X )−1 X ' Y
The predicted values of the response variable for the training examples are computed as
Y^ = X ^β
Let x be the vector containing values of the features for a new case, the expected response for
this case is predicted as
Y^ =x ' ^β
Precision of Variance of ^β is given by
V ( ^β ) =( X ' X ) σ 2
−1
e=Y − Y^
F statistic can be used to test the significance of the model.
Different models can also be compared using F test.
Prediction accuracy is measured by
V (^
Y )=X ( X ' X )− 1 X ' σ 2=H σ 2
The estimated prediction accuracy can be obtained using the estimated value of σ 2.
Updating regression estimators
To enable the learning algorithm to keep learning with each new experience, we need to
update the learned parameters. When a new example x becomes available, we can update the
learned values of regression coefficients as
^β new =( X ' X )−1
new X ' new Y new
Where
( X ' X )−1
new = ( X ' X+ x x ' )
−1
…(1)
(
= (X' X) −
−1 (( X ' X )− 1 x )( ( X ' X )−1 x ) '
1+ x ' ( X ' X ) x
−1 ) Rank one update formula
Thus, we have
Instead of updating ^
β with every new example, one can also choose to update ^β after every
batch of m examples. In that case ^β can be updated by applying the ‘rank-one update formula’
m times, adding one example at a time.
Then we know that f ' ( x 0 ) is the rate at which value of f changes with x at x = x 0.
In particular,
f ( x 0 +h )=f ( x 0) + h f ' ( x0 )
Since our objective is to minimize f, we take a small step in the direction opposite to gradient. In
other words, for positive gradient, we should decrease x, and for negative gradient, we should
increase x.
In practice we choose the step size h to be f ' ( x 0) , where is called the learning rate. This
approach of step size ensures that when f ' ( x 0 ) has a higher magnitude (i.e. x 0 away from the
optimal value), the step size is bigger; and when f ' ( x 0 ) has a smaller magnitude (i.e. x 0 close to
the optimal value), the step size is smaller, so that we obtain optimal value with a good
accuracy.
We keep changing the value of x in above manner until the minimum is reached (equivalently,
gradient becomes 0)
Remark: Although above explanation is for a univariate function, the rationale holds even when
f is a multivariate function.
More precisely, f ' ( x 0 ) is replaced with the vector of partial derivatives, called gradient
Y^ =f ( x ; β )= β 0 + β 1 x1 +...+ β p x p
Suppose Y is the true value and Y^ is the predicted value of the outcome variable. Then the
squared error loss function is given by
Since Y is a random variable, our objective is to minimize E(L(Y , Y^ )) . When we have training
data, this expectation is estimated by
n n
^L= 1 ∑ ( Y − ^ 1
i Y i) = ∑ (Y i − f ( x i ; β ) )
2 2
n i =1 n i=1
We then minimize above estimated quantity with respect to the parameters ’s.
The gradient vector is given by
n n
∂ ^ ∂ 1
∂β
L= ∑ ( Y − f ( x i ; β ))2= 2n ∑ ( Y i − f ( x i ; β )) ∂∂β (Y i − f ( x i ; β ))
∂ β n i=1 i i=1
n
2 ∂
= ∑ (Y i − f ( xi ; β ) ) ( Y − β − β 1 xi , 1 − …− β p x i , p )
n i=1 ∂β i 0
n
2
= ∑ (Y − f ( xi ; β )) ( −1 , − x i ,1 , … ,− x i, p )'
n i=1 i
Instead of estimating E(L(Y , Y^ )) based on the entire training data, we can take a random
sample from the training data and estimate this expected loss using this random sample. Also,
we draw a fresh random sample in every iteration.
Suppose S denote the training data. The gradient descent algorithm can be expressed in terms
of S as
Algorithm 1
Here ∑
S
denotes the summation taken over the entire training dataset S, and |S| denote the
cardinality of S. Since we are using the entire training data here, this algorithm is more precisely
referred to as the batch gradient descent algorithm.
The stochastic gradient algorithm given below is instead based on the random samples s j.
Algorithm 2
( 0)
1. Set j = 0. Take initial guess β =β0 .
2. (a) Draw a random sample sj from training dataset S.
2
(b) Compute gradient at β=β ( j ) given by G = ∑ ( Y i − f ( x i ; β )) ( −1 , − x i ,1 ,… , − x i , p ) '
(j) ( j)
|s j| s j