0% found this document useful (0 votes)
10 views

Stochastic Gradient Descent Algorithm

Data mining
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Stochastic Gradient Descent Algorithm

Data mining
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Multiple Linear Regression

Multiple linear regression model that utilizes multiple features and takes the form

Y = 0 + 1X1+ … + pXp + 

Here, the stochastic component  represents the noise and is generally assumed to have
normal distribution N (0 , σ 2).

A training dataset comprising on n examples can be expressed in matrix notation as


Y =X β +ε

Here  = (0, 1, … , p) is the vector of model parameters, also called regression coefficients.
The parameters of the model are typically learned (estimated) using least squares approach,
resulting in
^β=( X ' X )−1 X ' Y

The predicted values of the response variable for the training examples are computed as

Y^ = X ^β
Let x be the vector containing values of the features for a new case, the expected response for
this case is predicted as

Y^ =x ' ^β
Precision of Variance of ^β is given by

V ( ^β ) =( X ' X ) σ 2
−1

The variance σ 2 of the stochastic component can be unbiasedly estimated by


2 e' e
σ^ =
n− p−1
Where the vector of residuals e is given by

e=Y − Y^
 F statistic can be used to test the significance of the model.
 Different models can also be compared using F test.
Prediction accuracy is measured by

V (^
Y )=X ( X ' X )− 1 X ' σ 2=H σ 2

The estimated prediction accuracy can be obtained using the estimated value of σ 2.
Updating regression estimators
To enable the learning algorithm to keep learning with each new experience, we need to
update the learned parameters. When a new example x becomes available, we can update the
learned values of regression coefficients as
^β new =( X ' X )−1
new X ' new Y new

Where

( X ' X )−1
new = ( X ' X+ x x ' )
−1
…(1)

(
= (X' X) −
−1 (( X ' X )− 1 x )( ( X ' X )−1 x ) '
1+ x ' ( X ' X ) x
−1 ) Rank one update formula

Thus, we have

^β new = ^β + ( X ' X )−1 x


−1
( y − ^y )
1+ x ' ( X ' X ) x
…(2)
Thus, one can use last formula for updating ^β , and then update stored matrix ( X ' X )−1 . This
strategy requires storage of only ( X ' X )−1 in addition to ^β .

Instead of updating ^
β with every new example, one can also choose to update ^β after every
batch of m examples. In that case ^β can be updated by applying the ‘rank-one update formula’
m times, adding one example at a time.

Deciding importance of features


The F-test performed on the basis of ANOVA table for the learned regression model tests
whether the explanation provided by the learnt model is significant.
It must be understood that the significant F statistic does not imply suitability of the model for
prediction purpose. A value of R2 statistic close to 1, however, indicate usefulness of the model
for prediction purpose.
If a regression model is found useful, all the features used in the model may not be important.
It is therefore necessary to identify a subset of features that can provide almost the same
amount of explanation that is provided by the full set of features. Alternatively, we may have a
large set of candidate features which can possibly influence the output/ response that we wish
to model. This task is called Feature Selection.
An associated problem is when the inputs are not directly usable as features for developing the
learning models. In such situations one must first identify the features for the available inputs.
This task is called Feature Extraction.
Since the tasks of ‘feature selection’ and ‘feature extraction’ are important for every supervised
method, we will return to these tasks after covering other supervised methods.

Gradient descent Algorithm


The literal meaning of descending on a gradient means moving downwards on a slope.
Suppose f ( x ) is a function of x that we want to minimize with respect to x.

Let x 0 be the initial guess of the optimum solution.

Then we know that f ' ( x 0 ) is the rate at which value of f changes with x at x = x 0.

In particular,
f ( x 0 +h )=f ( x 0) + h f ' ( x0 )

If f ' ( x 0 ) >0 , f increases, whereas

If f ' ( x 0 ) <0 , f decreases.

Thus, f ' ( x 0 ) denotes the rate and the direction of change.

Since our objective is to minimize f, we take a small step in the direction opposite to gradient. In
other words, for positive gradient, we should decrease x, and for negative gradient, we should
increase x.

In practice we choose the step size h to be  f ' ( x 0) , where  is called the learning rate. This
approach of step size ensures that when f ' ( x 0 ) has a higher magnitude (i.e. x 0 away from the
optimal value), the step size is bigger; and when f ' ( x 0 ) has a smaller magnitude (i.e. x 0 close to
the optimal value), the step size is smaller, so that we obtain optimal value with a good
accuracy.
We keep changing the value of x in above manner until the minimum is reached (equivalently,
gradient becomes 0)
Remark: Although above explanation is for a univariate function, the rationale holds even when
f is a multivariate function.
More precisely, f ' ( x 0 ) is replaced with the vector of partial derivatives, called gradient

( ∂∂x f ( x ) , ∂∂x f ( x ) , … , ∂∂x f ( x ))


1
0
2
0
n
0
The algorithm
1. Take initial guess x 0.
2. Compute gradient at x=x 0
3. Set x 0 = x 0 – learning rate  gradient
4. If the magnitude of gradient is very small, output x 0, Else go to step 2.

Gradient descent Algorithm for multiple regression


Consider the multiple regression model:

Y^ =f ( x ; β )= β 0 + β 1 x1 +...+ β p x p

Suppose Y is the true value and Y^ is the predicted value of the outcome variable. Then the
squared error loss function is given by

L(Y , Y^ )=(Y −Y^ )


2

Since Y is a random variable, our objective is to minimize E(L(Y , Y^ )) . When we have training
data, this expectation is estimated by
n n
^L= 1 ∑ ( Y − ^ 1
i Y i) = ∑ (Y i − f ( x i ; β ) )
2 2

n i =1 n i=1

We then minimize above estimated quantity with respect to the parameters  ’s.
The gradient vector is given by
n n
∂ ^ ∂ 1
∂β
L= ∑ ( Y − f ( x i ; β ))2= 2n ∑ ( Y i − f ( x i ; β )) ∂∂β (Y i − f ( x i ; β ))
∂ β n i=1 i i=1

n
2 ∂
= ∑ (Y i − f ( xi ; β ) ) ( Y − β − β 1 xi , 1 − …− β p x i , p )
n i=1 ∂β i 0
n
2
= ∑ (Y − f ( xi ; β )) ( −1 , − x i ,1 , … ,− x i, p )'
n i=1 i

The gradient descent algorithm

1. Take initial guess β 0.


n
2
2. Compute gradient at β=β 0 given by G= ∑ ( Y − f ( x i ; β 0 ))( − 1 ,− xi , 1 , … , − x i , p ) '
n i=1 i
3. Set β 0 = β 0 − γ G ,  is learning rate
4. If the norm of gradient is very small, output β 0,
Else go to step 2.

This iterative algorithm obtains the same LS estimator of  as discussed earlier.


Q: Why to use alternative formula when a direct formula is available in closed form?

Stochastic Gradient Descent (SGD) Algorithm


Even though the iterative formula has advantage of achieving more accuracy over the direct
formula, in practice it involves a practical difficulty. Since the volume of training data is
supposed to be large for achieving a better training of the model, the computations involved in
the iterative algorithm are prohibitively large.

Instead of estimating E(L(Y , Y^ )) based on the entire training data, we can take a random
sample from the training data and estimate this expected loss using this random sample. Also,
we draw a fresh random sample in every iteration.
Suppose S denote the training data. The gradient descent algorithm can be expressed in terms
of S as
Algorithm 1

1. Set j = 0. Take initial guess β ( 0) =β0 .


2
(j)
2. Compute gradient at β=β ( j ) given by G = ∑ (Y − f ( x i ; β( j) ))( − 1 ,− xi , 1 , … , − x i , p ) '
|S| S i
3. Set β ( j+1 ) = β ( j) − γ G( j ) ,  is learning rate
4. If the norm of gradient is very small, output β ( j+1 ),
Else set j = j+1 and go to step 2.

Here ∑
S
denotes the summation taken over the entire training dataset S, and |S| denote the
cardinality of S. Since we are using the entire training data here, this algorithm is more precisely
referred to as the batch gradient descent algorithm.
The stochastic gradient algorithm given below is instead based on the random samples s j.
Algorithm 2
( 0)
1. Set j = 0. Take initial guess β =β0 .
2. (a) Draw a random sample sj from training dataset S.
2
(b) Compute gradient at β=β ( j ) given by G = ∑ ( Y i − f ( x i ; β )) ( −1 , − x i ,1 ,… , − x i , p ) '
(j) ( j)

|s j| s j

3. Set β ( j+1 ) = β ( j) − γ G( j ) ,  is learning rate


4. If the norm of gradient is very small, output β ( j+1 ),
Else set j = j+1 and go to step 2.
In practice, SGD algorithm is implemented by drawing the random samples s j of size 1, which
greatly reduce the computational effort involved in the iterative approach. This specific form of
the algorithm can be presented as

The SGD Algorithm

1. Set j = 0. Take initial guess β ( 0) =β0 .


2. (a) Draw a random observation x j from training dataset S.
(b) Compute gradient at β=β ( j ) given by G( j )=2 ( Y j − f ( x j ; β ( j) )) ( −1 ,− x i, 1 , … , − x i , p ) '
3. Set β ( j+1 ) = β ( j) − γ G( j ) ,  is learning rate
4. If the norm of gradient is very small, output β ( j+1 ),
Else set j = j+1 and go to step 2.
The most significant benefit of this algorithm is that with every new example that is received
from the field, it can further improve the parameter estimates and thereby improve the
performance. Thus the algorithm makes continuous learning feasible.

You might also like