0% found this document useful (0 votes)

7 views

04 Optimization

Uploaded by

rishabh johri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

04 Optimization

Uploaded by

rishabh johri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

Training Neural Networks:

Optimization
Intro to Deep Learning, Fall 2020

1
Recap
• Neural networks are universal approximators
• We must train them to approximate any
function
• Networks are trained to minimize total “error”
on a training set
– We do so through empirical risk minimization
• We use variants of gradient descent to do so
– Gradients are computed through backpropagation
17
The training formulation

output (y)

Input (X)

• Given input output pairs at a number of

locations, estimate the entire function
21
Gradient descent

• Start with an initial function

• Adjust its value at all points to make the outputs closer to the required
value
– Gradient descent adjusts parameters to adjust the function value at all points
– Repeat this iteratively until we get arbitrarily close to the target function at the
training points

22
Gradient descent

• Start with an initial function

23
Gradient descent

• Start with an initial function

24
Gradient descent

• Start with an initial function

27
Effect of number of samples

• Problem with conventional gradient descent: we try to

simultaneously adjust the function at all training points
– We must process all training points before making a single
adjustment
– “Batch” update
28
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time

– Keep adjustments small
– Eventually, when we have processed all the training points, we will
have adjusted the entire function
• With greater overall adjustment than we would if we made a single “Batch”
update

29
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time

30
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time

31
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time

33
Incremental Update
• Given , ,…,
• Initialize all weights
• Do:
– For all
• For every layer :
– Compute 𝒕 𝒕

– Update

• Until has converged

34
Incremental Updates
• The iterations can make multiple passes over
the data
• A single pass through the entire training data
is called an “epoch”
– An epoch over a training set with samples
results in updates of parameters

35
Incremental Update
• Given , ,…,
• Initialize all weights
• Do: Over multiple epochs One epoch

– For all
• For every layer :
– Compute 𝒕 𝒕

– Update
One update

• Until has converged

36
Caveats: order of presentation

• If we loop through the samples in the same

order, we may get cyclic behavior

37
Caveats: order of presentation

• If we loop through the samples in the same

order, we may get cyclic behavior

38
Caveats: order of presentation

• If we loop through the samples in the same

order, we may get cyclic behavior

39
Caveats: order of presentation

• If we loop through the samples in the same

order, we may get cyclic behavior

40
Caveats: order of presentation

• If we loop through the samples in the same order,

we may get cyclic behavior
• We must go through them randomly to get more
convergent behavior
41
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights
• Do:
– Randomly permute , ,…,
– For all
• For every layer :
– Compute 𝒕 𝒕
– Update
𝒕 𝒕

• Until has converged

46
Story so far
• In any gradient descent optimization problem,
presenting training instances incrementally
can be more effective than presenting them
all at once
– Provided training instances are provided in
random order
– “Stochastic Gradient Descent”

• This also holds for training neural networks

47
Batch vs SGD

Batch SGD

• Batch gradient descent operates over T training instances

to get a single update
• SGD gets T updates for the same computation
52
Caveats: learning rate

output (y)

Input (X)
• Except in the case of a perfect fit, even an optimal overall
fit will look incorrect to individual instances
– Correcting the function for individual instances will lead to
never-ending, non-convergent updates
– We must shrink the learning rate with iterations to prevent this
• Correction for individual instances with the eventual miniscule
learning rates will not modify the function 56
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For all
•
• For every layer :
– Compute 𝒕 𝒕
– Update
𝒕 𝒕

• Until has converged

57
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For all
Randomize input order
•
• For every layer :
Learning rate reduces with j
– Compute 𝒕 𝒕
– Update
𝒕 𝒕

• Until has converged

58
SGD example

• A simpler problem: K-means

• Note: SGD converges slower
• Also note the rather large variation between runs
– Lets try to understand these results.. 65
Explaining the variance

• The blue curve is the function being approximated

• The red curve is the approximation by the model at a given
• The heights of the shaded regions represent the point-by-point error
– The divergence is a function of the error
– We want to find the that minimizes the average divergence

73
Explaining the variance

• Sample estimate approximates the shaded area with the

average length of the lines of these curves is the red curve
itself
• Variance: The spread between the different curves is the
variance
74
Explaining the variance

• Sample estimate approximates the shaded area

with the average length of the lines
• This average length will change with position of
the samples 75
Explaining the variance

• Having more samples makes the estimate more

robust to changes in the position of samples
– The variance of the estimate is smaller
77
Explaining the variance
With only one sample

• Having very few samples makes the estimate

swing wildly with the sample position
– Since our estimator learns the to minimize this
estimate, the learned too can swing wildly 78
Explaining the variance
With only one sample

• Having very few samples makes the estimate

swing wildly with the sample position
– Since our estimator learns the to minimize this
estimate, the learned too can swing wildly 79
Explaining the variance
With only one sample

• Having very few samples makes the estimate

swing wildly with the sample position
– Since our estimator learns the to minimize this
estimate, the learned too can swing wildly 80
SGD vs batch
• SGD uses the gradient from only one sample
at a time, and is consequently high variance

• But also provides significantly quicker updates

than batch
• Is there a good medium?

82
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of

points
– Keep adjustments small
– If the subsets cover the training set, we will have adjusted the entire function
• As before, vary the subsets randomly in different passes through the
training data

83
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of

84
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of

85
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of

86
Incremental Update: Mini-batch
update
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1
• For every layer k:
– ∆𝑊 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» ∆𝑊 = ∆𝑊 + 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

• Update
– For every layer k:
𝑊 = 𝑊 − 𝜂 ∆𝑊
• Until has converged 87
Incremental Update: Mini-batch
update
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1 Mini-batch size
• For every layer k:
– ∆𝑊 = 0 Shrinking step size
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» ∆𝑊 = ∆𝑊 + 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

• Update
– For every layer k:
𝑊 = 𝑊 − 𝜂 ∆𝑊
• Until has converged 88
Mini Batches

• Mini-batch updates compute and minimize a batch loss

• The expected value of the batch loss is also the expected divergence

89
SGD example

• Mini-batch performs comparably to batch

training on this simple problem
– But converges orders of magnitude faster
93
Training and minibatches
• In practice, training is usually performed using mini-
batches
– The mini-batch size is a hyper parameter to be optimized

• Convergence depends on learning rate

– Simple technique: fix learning rate until the error plateaus,
then reduce learning rate by a fixed factor (e.g. 10)
– Advanced methods: Adaptive updates, where the learning
rate is itself determined as part of the estimation

95
Momentum and incremental updates

SGD instance
or minibatch
loss
• The momentum method

• Incremental SGD and mini-batch gradients tend to have

high variance
• Momentum smooths out the variations
– Smoother and faster convergence
100
Momentum: Mini-batch update
• Given , ,…,
• Initialize all weights ; ,
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1
• For every layer k:
– 𝛻 𝐿𝑜𝑠𝑠 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» 𝛻 𝐿𝑜𝑠𝑠 += 𝛻 𝑫𝒊𝒗(𝑌 , 𝑑 )
• Update
– For every layer k:
Δ𝑊 = 𝛽Δ𝑊 − 𝜂 (𝛻 𝐿𝑜𝑠𝑠)
𝑊 = 𝑊 + ∆𝑊
• Until has converged
101
Nestorov’s Accelerated Gradient

• At any iteration, to compute the current step:

– First extend the previous step
– Then compute the gradient at the resultant position
– Add the two to obtain the final step
• This also applies directly to incremental update methods
– The accelerated gradient smooths out the variance in the
gradients
102
Nestorov’s Accelerated Gradient

SGD instance
or minibatch
loss
• Nestorov’s method
( )

103
Nestorov: Mini-batch update
• Given , ,…,
• Initialize all weights ; 𝑗 = 0, ∆𝑊 = 0
• Do:
– Randomly permute 𝑋 , 𝑑 , 𝑋 , 𝑑 ,…, 𝑋 , 𝑑
– For 𝑡 = 1: 𝑏: 𝑇
• 𝑗=𝑗+1
• For every layer k:
– 𝑊 = 𝑊 + 𝛽Δ𝑊
– 𝛻 𝐿𝑜𝑠𝑠 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» 𝛻 𝐿𝑜𝑠𝑠 += 𝛻 𝑫𝒊𝒗(𝑌 , 𝑑 )
• Update
– For every layer k:
𝑊 =𝑊 −𝜂 𝛻 𝐿𝑜𝑠𝑠𝑇
Δ𝑊 = 𝛽Δ𝑊 − 𝜂 𝛻 𝐿𝑜𝑠𝑠𝑇

• Until has converged

104
Still higher-order methods
• Momentum and Nestorov’s method improve
convergence by normalizing the mean of the
derivatives
• More recent methods take this one step further by also
considering their variance
– RMS Prop
– Adagrad
– AdaDelta
– ADAM: very popular in practice
– …
• All roughly equivalent in performance
105
Smoothing the trajectory
Step X component Y component

1 1 +2.5
2 1 -3
1 2 4 3 2 +2.5
3 5
4 1 -2
5 1.5 1.5

• Observation: Steps in “oscillatory” directions show large total

movement
– In the example, total motion in the vertical direction is much greater
than in the horizontal direction
– Can happen even when momentum or Nestorov are used
• Improvement: Dampen step size in directions with high motion
– Second order term
106
Normalizing steps by second moment

• Modify usual gradient-based update:

– Scale updates in every component in inverse proportion to the total
movement of that component in recent past
• According to their variation (not just their average)
• This will change the relative update sizes for the individual
components
– In the above example it would scale down Y component
– And scale up X component (in comparison)
• We will see two popular methods that embody this principle…
107
RMS Prop
• Notation:
– Updates are by parameter

– Derivative of loss w.r.t any individual parameter is shown as

• Batch or minibatch loss, or individual divergence for batch/minibatch/SGD

– The squared derivative is

• Short-hand notation represents the squared derivative, not the second
derivative

– The mean squared derivative is a running estimate of the average

squared derivative. We will show this as

• Modified update rule: We want to

– scale down updates with large mean squared derivatives
– scale up updates with small mean squared derivatives
108
RMS Prop
• This is a variant on the basic mini-batch SGD algorithm

• Procedure:
– Maintain a running estimate of the mean squared value of
derivatives for each parameter
– Scale update of the parameter by the inverse of the root mean
squared derivative

109
RMS Prop (updates are for each
• Do:
weight of each layer)
– Randomly shuffle inputs to change their order
– Initialize: ; for all weights in all layers,
– For all (incrementing in blocks of inputs)
• For all weights in all layers initialize 𝜕 𝐷 =0
• For 𝑏 = 0: 𝐵 − 1
– Compute
» Output 𝒀(𝑿𝒕 𝒃)
𝒅𝑫𝒊𝒗(𝒀(𝑿𝒕 𝒃 ),𝒅𝒕 𝒃 )
» Compute gradient
𝒅𝒘
𝒅𝑫𝒊𝒗(𝒀(𝑿𝒕 𝒃 ),𝒅𝒕 𝒃 )
» Compute 𝜕 𝐷 +=
𝒅𝒘

• update: for all 𝑤 ∈ 𝑤 ∀𝑖, 𝑗, 𝑘

𝑬 𝝏𝟐𝒘 𝑫 𝒌
= 𝜸𝑬 𝝏𝟐𝒘 𝑫 𝒌 𝟏
+ 𝟏 − 𝜸 𝝏𝟐𝒘 𝑫 𝒌
𝜼 Typical values:
𝒘𝒌 𝟏 = 𝒘𝒌 − 𝝏𝒘 𝑫
𝑬 𝝏𝟐𝒘 𝑫 𝒌+𝝐

• 𝑘 =𝑘+1
• Until loss has converged
111
ADAM: RMSprop with momentum
• RMS prop only considers a second-moment normalized version of the current
gradient
• ADAM utilizes a smoothed version of the momentum-augmented gradient
– Considers both first and second moments

• Procedure:
– Maintain a running estimate of the mean derivative for each parameter
– Maintain a running estimate of the mean squared value of derivatives for each parameter
– Scale update of the parameter by the inverse of the root mean squared derivative

𝑚 𝑣
𝑚 = , 𝑣 =
1−𝛿 1−𝛾
𝜂
𝑤 =𝑤 − 𝑚
𝑣 +𝜖

112
ADAM: RMSprop with momentum
• RMS prop only considers a second-moment normalized version of the
current gradient
• ADAM utilizes a smoothed version of the momentum-augmented gradient

• Procedure:
– Maintain a running estimate of the mean derivative for each parameter
– Maintain a running estimate of the mean squared value of Ensures
derivatives
thatfor
theeach
parameter and terms do
– Scale update of the parameter by the inverse of the root mean squared in
not dominate
derivative early
iterations

113
Other variants of the same theme
• Many:
– Adagrad
– AdaDelta
– AdaMax
– …
• Generally no explicit learning rate to optimize
– But come with other hyper parameters to be optimized
– Typical params:
• RMSProp: ,
• ADAM: , ,

114
Visualizing the optimizers: Beale’s Function

• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html

115
Visualizing the optimizers: Long Valley

• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html

116
Visualizing the optimizers: Saddle Point

• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html

117
Story so far
• Gradient descent can be sped up by incremental
updates
– Convergence is guaranteed under most conditions
• Learning rate must shrink with time for convergence
– Stochastic gradient descent: update after each
observation. Can be much faster than batch learning
– Mini-batch updates: update after batches. Can be more
efficient than SGD

• Convergence can be improved using smoothed updates

– RMSprop and more advanced techniques

118

STATISTICS Grade 12
100% (3)
STATISTICS Grade 12
22 pages
1 Prob & Stats FAST (Final Term-Online Paper)
No ratings yet
1 Prob & Stats FAST (Final Term-Online Paper)
3 pages
Statistics & Probability Q4 - Week 1-2
50% (2)
Statistics & Probability Q4 - Week 1-2
13 pages
7 Optimization2 Stochastic Gradient
No ratings yet
7 Optimization2 Stochastic Gradient
114 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
week 06 - Deep Feedforward Networks - Optimization
No ratings yet
week 06 - Deep Feedforward Networks - Optimization
83 pages
Stochastic Gradient Descent - Term Paper
No ratings yet
Stochastic Gradient Descent - Term Paper
8 pages
Gradient Descent 5 Part 2
No ratings yet
Gradient Descent 5 Part 2
15 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Gdesc LMS
No ratings yet
Gdesc LMS
7 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
DL 3
No ratings yet
DL 3
72 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
Technical_writing (1)
No ratings yet
Technical_writing (1)
9 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
9.b Handout-3-GD variants
No ratings yet
9.b Handout-3-GD variants
3 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
DSCTP 2022 1 ML Slides
No ratings yet
DSCTP 2022 1 ML Slides
110 pages
cours5
No ratings yet
cours5
23 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Unit 2
No ratings yet
Unit 2
13 pages
S09_DNN_Gradients_wip
No ratings yet
S09_DNN_Gradients_wip
28 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
Linearna Regresija - NG
No ratings yet
Linearna Regresija - NG
7 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
Machine Learning Notes by Standard Andrew Ng
No ratings yet
Machine Learning Notes by Standard Andrew Ng
142 pages
ML - WEEK 06
No ratings yet
ML - WEEK 06
31 pages
cs229 2
No ratings yet
cs229 2
275 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
CS229
No ratings yet
CS229
69 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
Regression
No ratings yet
Regression
30 pages
Technical_writing (2)
No ratings yet
Technical_writing (2)
9 pages
Technical_writing
No ratings yet
Technical_writing
8 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
UNIT3
No ratings yet
UNIT3
37 pages
Gradient Descent and Its Types
No ratings yet
Gradient Descent and Its Types
5 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
WEEK 4
No ratings yet
WEEK 4
61 pages
PCA and Convex optimization and bias , Variance-2
No ratings yet
PCA and Convex optimization and bias , Variance-2
29 pages
14 Efficient Learning
No ratings yet
14 Efficient Learning
7 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
DLA-CAT 1
No ratings yet
DLA-CAT 1
37 pages
Unit 2 Introduction to Deep Learning
No ratings yet
Unit 2 Introduction to Deep Learning
79 pages
CS229 Lecture 2 PDF
100% (1)
CS229 Lecture 2 PDF
48 pages
NumPy Beginner's Guide
From Everand
NumPy Beginner's Guide
Ivan Idris
5/5 (3)
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Exercises of Functions Series
From Everand
Exercises of Functions Series
Simone Malacrida
No ratings yet
Exercises of Logarithms and Exponentials
From Everand
Exercises of Logarithms and Exponentials
Simone Malacrida
No ratings yet
Maths 2nd
No ratings yet
Maths 2nd
7 pages
Stat 151 Formulas
100% (1)
Stat 151 Formulas
3 pages
Complete Download of Essentials of Modern Business Statistics with Microsoft Excel 6th Edition Anderson Solutions Manual Full Chapters in PDF DOCX
100% (3)
Complete Download of Essentials of Modern Business Statistics with Microsoft Excel 6th Edition Anderson Solutions Manual Full Chapters in PDF DOCX
42 pages
MATH IA Ali
No ratings yet
MATH IA Ali
16 pages
SSRN Id4377891
No ratings yet
SSRN Id4377891
35 pages
Sta 342 Assignment
No ratings yet
Sta 342 Assignment
1 page
1 s2.0 S235197892030696X Main
No ratings yet
1 s2.0 S235197892030696X Main
8 pages
1 s2.0 S0169207006000239 Main
No ratings yet
1 s2.0 S0169207006000239 Main
10 pages
Uji Multikolinearitas
No ratings yet
Uji Multikolinearitas
4 pages
8 Normal Distribution
No ratings yet
8 Normal Distribution
7 pages
LSJ PDF
No ratings yet
LSJ PDF
433 pages
Unit-Iii: Statistical Estimation Theory Unbiased Estimates
No ratings yet
Unit-Iii: Statistical Estimation Theory Unbiased Estimates
14 pages
Descriptive and Inferential Statistics
100% (1)
Descriptive and Inferential Statistics
10 pages
Kedah Mahsuri MATHS T, SEM 3 2015
No ratings yet
Kedah Mahsuri MATHS T, SEM 3 2015
3 pages
Ordinal Logistic Regression MC
No ratings yet
Ordinal Logistic Regression MC
36 pages
Nonparametric Permutation
No ratings yet
Nonparametric Permutation
31 pages
Class 7
No ratings yet
Class 7
42 pages
IME 755-Chapter 3
No ratings yet
IME 755-Chapter 3
41 pages
اختبار الاحتمالات و الإحصاء
No ratings yet
اختبار الاحتمالات و الإحصاء
4 pages
Mathematics in The Modern World Juan Apolinario C. Reyes, MS
No ratings yet
Mathematics in The Modern World Juan Apolinario C. Reyes, MS
5 pages
Auto Trendlines
No ratings yet
Auto Trendlines
8 pages
Pearson Product Moment Correlation Coefficient
No ratings yet
Pearson Product Moment Correlation Coefficient
9 pages
Applied Statistics eBook
No ratings yet
Applied Statistics eBook
253 pages
Sem 5 - ADS511
No ratings yet
Sem 5 - ADS511
15 pages
Graduate Statistics in Excel Manual 1 S
No ratings yet
Graduate Statistics in Excel Manual 1 S
358 pages
Calculating Spearman's Rank Correlation Coefficient Using Technology - IBDP Mathematics - Applications and Interpretation SL FE2021 - Kognity
No ratings yet
Calculating Spearman's Rank Correlation Coefficient Using Technology - IBDP Mathematics - Applications and Interpretation SL FE2021 - Kognity
4 pages

04 Optimization

Uploaded by

04 Optimization

Uploaded by

Training Neural Networks:

• Given input output pairs at a number of

• Start with an initial function

• Start with an initial function

• Start with an initial function

• Start with an initial function

• Problem with conventional gradient descent: we try to

• Alternative: adjust the function at one training point at a time

• Alternative: adjust the function at one training point at a time

• Alternative: adjust the function at one training point at a time

• Alternative: adjust the function at one training point at a time

• Until has converged

• Until has converged

• If we loop through the samples in the same

• If we loop through the samples in the same

• If we loop through the samples in the same

• If we loop through the samples in the same

• If we loop through the samples in the same order,

• Until has converged

• This also holds for training neural networks

• Batch gradient descent operates over T training instances

• Until has converged

• Until has converged

• A simpler problem: K-means

• The blue curve is the function being approximated

• Sample estimate approximates the shaded area with the

• Sample estimate approximates the shaded area

• Having more samples makes the estimate more

• Having very few samples makes the estimate

• Having very few samples makes the estimate

• Having very few samples makes the estimate

• But also provides significantly quicker updates

• Alternative: adjust the function at a small, randomly chosen subset of

• Alternative: adjust the function at a small, randomly chosen subset of

• Alternative: adjust the function at a small, randomly chosen subset of

• Alternative: adjust the function at a small, randomly chosen subset of

• Mini-batch updates compute and minimize a batch loss

• Mini-batch performs comparably to batch

• Convergence depends on learning rate

• Incremental SGD and mini-batch gradients tend to have

• At any iteration, to compute the current step:

• Until has converged

• Observation: Steps in “oscillatory” directions show large total

• Modify usual gradient-based update:

– Derivative of loss w.r.t any individual parameter is shown as

– The squared derivative is

– The mean squared derivative is a running estimate of the average

• Modified update rule: We want to

• update: for all 𝑤 ∈ 𝑤 ∀𝑖, 𝑗, 𝑘

• Convergence can be improved using smoothed updates

You might also like