0% found this document useful (0 votes)

24 views6 pages

Stochastic Gradient Descent Algorithm

Data mining

Uploaded by

thakareganesh1906

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views6 pages

Stochastic Gradient Descent Algorithm

Data mining

Uploaded by

thakareganesh1906

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Multiple Linear Regression

Multiple linear regression model that utilizes multiple features and takes the form

Y = 0 + 1X1+ … + pXp + 

Here, the stochastic component  represents the noise and is generally assumed to have
normal distribution N (0 , σ 2).

A training dataset comprising on n examples can be expressed in matrix notation as

Y =X β +ε

Here  = (0, 1, … , p) is the vector of model parameters, also called regression coefficients.
The parameters of the model are typically learned (estimated) using least squares approach,
resulting in
^β=( X ' X )−1 X ' Y

The predicted values of the response variable for the training examples are computed as

Y^ = X ^β
Let x be the vector containing values of the features for a new case, the expected response for
this case is predicted as

Y^ =x ' ^β
Precision of Variance of ^β is given by

V ( ^β ) =( X ' X ) σ 2
−1

The variance σ 2 of the stochastic component can be unbiasedly estimated by

2 e' e
σ^ =
n− p−1
Where the vector of residuals e is given by

e=Y − Y^
 F statistic can be used to test the significance of the model.
 Different models can also be compared using F test.
Prediction accuracy is measured by

V (^
Y )=X ( X ' X )− 1 X ' σ 2=H σ 2

The estimated prediction accuracy can be obtained using the estimated value of σ 2.
Updating regression estimators
To enable the learning algorithm to keep learning with each new experience, we need to
update the learned parameters. When a new example x becomes available, we can update the
learned values of regression coefficients as
^β new =( X ' X )−1
new X ' new Y new

Where

( X ' X )−1
new = ( X ' X+ x x ' )
−1
…(1)

(
= (X' X) −
−1 (( X ' X )− 1 x )( ( X ' X )−1 x ) '
1+ x ' ( X ' X ) x
−1 ) Rank one update formula

Thus, we have

^β new = ^β + ( X ' X )−1 x

−1
( y − ^y )
1+ x ' ( X ' X ) x
…(2)
Thus, one can use last formula for updating ^β , and then update stored matrix ( X ' X )−1 . This
strategy requires storage of only ( X ' X )−1 in addition to ^β .

Instead of updating ^
β with every new example, one can also choose to update ^β after every
batch of m examples. In that case ^β can be updated by applying the ‘rank-one update formula’
m times, adding one example at a time.

Deciding importance of features

The F-test performed on the basis of ANOVA table for the learned regression model tests
whether the explanation provided by the learnt model is significant.
It must be understood that the significant F statistic does not imply suitability of the model for
prediction purpose. A value of R2 statistic close to 1, however, indicate usefulness of the model
for prediction purpose.
If a regression model is found useful, all the features used in the model may not be important.
It is therefore necessary to identify a subset of features that can provide almost the same
amount of explanation that is provided by the full set of features. Alternatively, we may have a
large set of candidate features which can possibly influence the output/ response that we wish
to model. This task is called Feature Selection.
An associated problem is when the inputs are not directly usable as features for developing the
learning models. In such situations one must first identify the features for the available inputs.
This task is called Feature Extraction.
Since the tasks of ‘feature selection’ and ‘feature extraction’ are important for every supervised
method, we will return to these tasks after covering other supervised methods.

Gradient descent Algorithm

The literal meaning of descending on a gradient means moving downwards on a slope.
Suppose f ( x ) is a function of x that we want to minimize with respect to x.

Let x 0 be the initial guess of the optimum solution.

Then we know that f ' ( x 0 ) is the rate at which value of f changes with x at x = x 0.

In particular,
f ( x 0 +h )=f ( x 0) + h f ' ( x0 )

If f ' ( x 0 ) >0 , f increases, whereas

If f ' ( x 0 ) <0 , f decreases.

Thus, f ' ( x 0 ) denotes the rate and the direction of change.

Since our objective is to minimize f, we take a small step in the direction opposite to gradient. In
other words, for positive gradient, we should decrease x, and for negative gradient, we should
increase x.

In practice we choose the step size h to be  f ' ( x 0) , where  is called the learning rate. This
approach of step size ensures that when f ' ( x 0 ) has a higher magnitude (i.e. x 0 away from the
optimal value), the step size is bigger; and when f ' ( x 0 ) has a smaller magnitude (i.e. x 0 close to
the optimal value), the step size is smaller, so that we obtain optimal value with a good
accuracy.
We keep changing the value of x in above manner until the minimum is reached (equivalently,
gradient becomes 0)
Remark: Although above explanation is for a univariate function, the rationale holds even when
f is a multivariate function.
More precisely, f ' ( x 0 ) is replaced with the vector of partial derivatives, called gradient

( ∂∂x f ( x ) , ∂∂x f ( x ) , … , ∂∂x f ( x ))

1
0
2
0
n
0
The algorithm
1. Take initial guess x 0.
2. Compute gradient at x=x 0
3. Set x 0 = x 0 – learning rate  gradient
4. If the magnitude of gradient is very small, output x 0, Else go to step 2.

Gradient descent Algorithm for multiple regression

Consider the multiple regression model:

Y^ =f ( x ; β )= β 0 + β 1 x1 +...+ β p x p

Suppose Y is the true value and Y^ is the predicted value of the outcome variable. Then the
squared error loss function is given by

L(Y , Y^ )=(Y −Y^ )

Since Y is a random variable, our objective is to minimize E(L(Y , Y^ )) . When we have training
data, this expectation is estimated by
n n
^L= 1 ∑ ( Y − ^ 1
i Y i) = ∑ (Y i − f ( x i ; β ) )
2 2

n i =1 n i=1

We then minimize above estimated quantity with respect to the parameters  ’s.
The gradient vector is given by
n n
∂ ^ ∂ 1
∂β
L= ∑ ( Y − f ( x i ; β ))2= 2n ∑ ( Y i − f ( x i ; β )) ∂∂β (Y i − f ( x i ; β ))
∂ β n i=1 i i=1

n
2 ∂
= ∑ (Y i − f ( xi ; β ) ) ( Y − β − β 1 xi , 1 − …− β p x i , p )
n i=1 ∂β i 0
n
2
= ∑ (Y − f ( xi ; β )) ( −1 , − x i ,1 , … ,− x i, p )'
n i=1 i

The gradient descent algorithm

1. Take initial guess β 0.

n
2
2. Compute gradient at β=β 0 given by G= ∑ ( Y − f ( x i ; β 0 ))( − 1 ,− xi , 1 , … , − x i , p ) '
n i=1 i
3. Set β 0 = β 0 − γ G ,  is learning rate
4. If the norm of gradient is very small, output β 0,
Else go to step 2.

This iterative algorithm obtains the same LS estimator of  as discussed earlier.

Q: Why to use alternative formula when a direct formula is available in closed form?

Stochastic Gradient Descent (SGD) Algorithm

Even though the iterative formula has advantage of achieving more accuracy over the direct
formula, in practice it involves a practical difficulty. Since the volume of training data is
supposed to be large for achieving a better training of the model, the computations involved in
the iterative algorithm are prohibitively large.

Instead of estimating E(L(Y , Y^ )) based on the entire training data, we can take a random
sample from the training data and estimate this expected loss using this random sample. Also,
we draw a fresh random sample in every iteration.
Suppose S denote the training data. The gradient descent algorithm can be expressed in terms
of S as
Algorithm 1

1. Set j = 0. Take initial guess β ( 0) =β0 .

2
(j)
2. Compute gradient at β=β ( j ) given by G = ∑ (Y − f ( x i ; β( j) ))( − 1 ,− xi , 1 , … , − x i , p ) '
|S| S i
3. Set β ( j+1 ) = β ( j) − γ G( j ) ,  is learning rate
4. If the norm of gradient is very small, output β ( j+1 ),
Else set j = j+1 and go to step 2.

Here ∑
S
denotes the summation taken over the entire training dataset S, and |S| denote the
cardinality of S. Since we are using the entire training data here, this algorithm is more precisely
referred to as the batch gradient descent algorithm.
The stochastic gradient algorithm given below is instead based on the random samples s j.
Algorithm 2
( 0)
1. Set j = 0. Take initial guess β =β0 .
2. (a) Draw a random sample sj from training dataset S.
2
(b) Compute gradient at β=β ( j ) given by G = ∑ ( Y i − f ( x i ; β )) ( −1 , − x i ,1 ,… , − x i , p ) '
(j) ( j)

|s j| s j

3. Set β ( j+1 ) = β ( j) − γ G( j ) ,  is learning rate

4. If the norm of gradient is very small, output β ( j+1 ),
Else set j = j+1 and go to step 2.
In practice, SGD algorithm is implemented by drawing the random samples s j of size 1, which
greatly reduce the computational effort involved in the iterative approach. This specific form of
the algorithm can be presented as

The SGD Algorithm

1. Set j = 0. Take initial guess β ( 0) =β0 .

2. (a) Draw a random observation x j from training dataset S.
(b) Compute gradient at β=β ( j ) given by G( j )=2 ( Y j − f ( x j ; β ( j) )) ( −1 ,− x i, 1 , … , − x i , p ) '
3. Set β ( j+1 ) = β ( j) − γ G( j ) ,  is learning rate
4. If the norm of gradient is very small, output β ( j+1 ),
Else set j = j+1 and go to step 2.
The most significant benefit of this algorithm is that with every new example that is received
from the field, it can further improve the parameter estimates and thereby improve the
performance. Thus the algorithm makes continuous learning feasible.

Gradient Descent in Linear Regression
No ratings yet
Gradient Descent in Linear Regression
30 pages
Lecture 3 ML_optimization
No ratings yet
Lecture 3 ML_optimization
32 pages
The Monkeys Paw
No ratings yet
The Monkeys Paw
2 pages
Bachelor Thesis Software Testing
100% (3)
Bachelor Thesis Software Testing
7 pages
Notes 3
No ratings yet
Notes 3
59 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Logistic
No ratings yet
Logistic
14 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Gradient descent
No ratings yet
Gradient descent
16 pages
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
No ratings yet
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
12 pages
Module2-Optimizations
No ratings yet
Module2-Optimizations
65 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Machine Learning Notes by Standard Andrew Ng
No ratings yet
Machine Learning Notes by Standard Andrew Ng
142 pages
Module 3
No ratings yet
Module 3
27 pages
Hemisphericity
No ratings yet
Hemisphericity
2 pages
2. Linear_ Regression_SGD
No ratings yet
2. Linear_ Regression_SGD
71 pages
CS229 Lecture 2 PDF
100% (1)
CS229 Lecture 2 PDF
48 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
8 pages
Module3_Ch1
No ratings yet
Module3_Ch1
83 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
Steve Vai-Tempo Mental
100% (1)
Steve Vai-Tempo Mental
12 pages
CS229
No ratings yet
CS229
69 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
Completing Story: 01. King Midas and The Golden Touch Story
100% (2)
Completing Story: 01. King Midas and The Golden Touch Story
4 pages
cs229 2
No ratings yet
cs229 2
275 pages
DHTML Event Model
100% (1)
DHTML Event Model
40 pages
02_lecturenote_GD
No ratings yet
02_lecturenote_GD
10 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
3.Linear Regression
No ratings yet
3.Linear Regression
18 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
A Layman's Guide to the Project
No ratings yet
A Layman's Guide to the Project
34 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
What Is Machine Learning by Coursera
No ratings yet
What Is Machine Learning by Coursera
47 pages
Linear Integrated Circuits Notes
No ratings yet
Linear Integrated Circuits Notes
5 pages
MACHINE LEARNING ALGORITHM Unit-II
No ratings yet
MACHINE LEARNING ALGORITHM Unit-II
115 pages
Lecture3_upload
No ratings yet
Lecture3_upload
28 pages
Linear+regression+with+one+variable
No ratings yet
Linear+regression+with+one+variable
48 pages
5.1Loss Function, Optimization,Gd
No ratings yet
5.1Loss Function, Optimization,Gd
39 pages
Lecture 7 (with notes)
No ratings yet
Lecture 7 (with notes)
39 pages
Regression
No ratings yet
Regression
30 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
5 Optimizers
No ratings yet
5 Optimizers
10 pages
(Machine Learning Coursera) Lecture Note Week 1
No ratings yet
(Machine Learning Coursera) Lecture Note Week 1
8 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Basic Machine Learning: Case Study
No ratings yet
Basic Machine Learning: Case Study
11 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
10 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
upbhulekh.gov.in_public_public_ror_action_captchamatche1
No ratings yet
upbhulekh.gov.in_public_public_ror_action_captchamatche1
3 pages
Analisis Pengaruh Kualitas Produk, Kualitas Layanan Dan Persepsi Harga Terhadap Kepuasan Pelanggan Air Minum Dalam Kemasan
No ratings yet
Analisis Pengaruh Kualitas Produk, Kualitas Layanan Dan Persepsi Harga Terhadap Kepuasan Pelanggan Air Minum Dalam Kemasan
141 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Linearna Regresija - NG
No ratings yet
Linearna Regresija - NG
7 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
ML Notes
No ratings yet
ML Notes
14 pages
ML Movie Review
100% (1)
ML Movie Review
2 pages
05 Gradient Descent
No ratings yet
05 Gradient Descent
23 pages
4. Gradient Descent
No ratings yet
4. Gradient Descent
15 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
25 pages
GE02 Reflection Paper
No ratings yet
GE02 Reflection Paper
2 pages
Teach Yourself CORBA in 14 Days
No ratings yet
Teach Yourself CORBA in 14 Days
403 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
Advanced Machine Learning: Module-1
No ratings yet
Advanced Machine Learning: Module-1
164 pages
1.0 Team Assessment Questionnaire
No ratings yet
1.0 Team Assessment Questionnaire
3 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
Class Schedule
No ratings yet
Class Schedule
2 pages
Tom Mitchell Provides A More Modern Definition
No ratings yet
Tom Mitchell Provides A More Modern Definition
10 pages
Migration From Sybase Installation Instructions
No ratings yet
Migration From Sybase Installation Instructions
5 pages
Pedretti Nazir 2011
No ratings yet
Pedretti Nazir 2011
26 pages
01 Sample Jsa For Grading & Backfilling (Hindi, Eng & Telugu)
No ratings yet
01 Sample Jsa For Grading & Backfilling (Hindi, Eng & Telugu)
1 page
Ielts Quiz Jeopardy Game Fun Activities Games Tests 54478
No ratings yet
Ielts Quiz Jeopardy Game Fun Activities Games Tests 54478
26 pages
Análise Fatorial
No ratings yet
Análise Fatorial
10 pages
Pre Test Questionnaire
No ratings yet
Pre Test Questionnaire
5 pages
Major Test
No ratings yet
Major Test
93 pages
Young Talk, August 2008
No ratings yet
Young Talk, August 2008
4 pages
Design and Test of Annular Discharge Orifice of Shock Transmission Unit
No ratings yet
Design and Test of Annular Discharge Orifice of Shock Transmission Unit
8 pages
Risk Assessment House
No ratings yet
Risk Assessment House
2 pages
Ultra Precision Engineering
No ratings yet
Ultra Precision Engineering
74 pages
Timing Belt PDF
No ratings yet
Timing Belt PDF
4 pages
Laws of The Universe
No ratings yet
Laws of The Universe
3 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Envmath 4 12 TA P
100% (1)
Envmath 4 12 TA P
2 pages
CG Carpentry For Grades 7-10
100% (1)
CG Carpentry For Grades 7-10
13 pages
Non-Violent Facilitator Guide
100% (1)
Non-Violent Facilitator Guide
60 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Stochastic Gradient Descent Algorithm

Uploaded by

Stochastic Gradient Descent Algorithm

Uploaded by

Multiple Linear Regression

A training dataset comprising on n examples can be expressed in matrix notation as

The variance σ 2 of the stochastic component can be unbiasedly estimated by

^β new = ^β + ( X ' X )−1 x

Deciding importance of features

Gradient descent Algorithm

Let x 0 be the initial guess of the optimum solution.

If f ' ( x 0 ) >0 , f increases, whereas

If f ' ( x 0 ) <0 , f decreases.

Thus, f ' ( x 0 ) denotes the rate and the direction of change.

( ∂∂x f ( x ) , ∂∂x f ( x ) , … , ∂∂x f ( x ))

Gradient descent Algorithm for multiple regression

L(Y , Y^ )=(Y −Y^ )

The gradient descent algorithm

1. Take initial guess β 0.

This iterative algorithm obtains the same LS estimator of  as discussed earlier.

Stochastic Gradient Descent (SGD) Algorithm

1. Set j = 0. Take initial guess β ( 0) =β0 .

3. Set β ( j+1 ) = β ( j) − γ G( j ) ,  is learning rate

The SGD Algorithm

1. Set j = 0. Take initial guess β ( 0) =β0 .

You might also like