0% found this document useful (0 votes)

18 views24 pages

Week 5 Optimisation

Introduction to Applied Machine Learning

Uploaded by

zeliawillscumberg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views24 pages

Week 5 Optimisation

Introduction to Applied Machine Learning

Uploaded by

zeliawillscumberg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

IAML: Optimization

Nigel Goddard
School of Informatics

Semester 1

1 / 24
Outline

I Why we use optimization in machine learning

I The general optimization problem
I Gradient descent
I Problems with gradient descent
I Batch versus online
I Second-order methods
I Constrained optimization

Many illustrations, text, and general ideas from these slides are taken from Sam Roweis (1972-2010).

2 / 24
Why Optimization

I A main idea in machine learning is to convert the learning

problem into a continuous optimization problem.
I Examples: Linear regression, logistic regression (we have
seen), neural networks, SVMs (we will see these later)
I One way to do this is maximum likelihood

`(w) = log p(y1 , x1 , y2 , x2 , . . . , yn , xn |w)

n
Y
= log p(yi , xi |w)
i=1
n
X
= log p(yi , xi |w)
i=1

I Example: Linear regression

3 / 24
– absolute weight decay (lasso) ⇔ Laplace prior (decay = 1/Λ) •
– smoothing on multinomial parameters ⇔ Dirichlet prior
– smoothing on covariance matrices ⇔ Wishart prior
I End result: an “error function” E(w) which we want to
minimize.
I e.g., E(w)Error
can be the negative
Surfaces of theSpace
and Weight log likelihood.
5
I Consider
• End result:aanfixed
“errortraining
function” set;
E(w) think in want
which we weight (not input)
to minimize. •
space. At each
• E(w) can settingofof
be the negative thethe weightsorthere
log likelihood is some error
log posterior.
(given thea fixed
• Consider training
fixed training set):in this
set; think weightdefines anspace.
(not input) error surface •
At each setting
in weight space. of the weights there is some error (given the fixed •
training set): this defines an error surface in weight space.
I Learning == descending the error surface.
• Learning == descending the error surface. •
I If •the data
Notice: are
If the IID,
data are the error
IID, the errorfunction
function EEis aissum
a sum
of errorof error •
function
functionsEEi nfor
, oneeach
per datadata point
point.
E wj •
E(w)
E(w)

w wi

4 / 24
Role of Smoothness
If E completely unconstrained, minimization is impossible.

E(w)

w
All we could do is search through all possible values w.

Key idea: If E is continuous, then measuring E(w) gives

information about E at many nearby values.
5 / 24
Role of Derivatives

I If we wiggle wk and keep everything else the same, does

the error get better or worse?
∂E
I Calculus has an answer to exactly this question: ∂wk
I So: use a differentiable cost function E and compute
partial derivatives of each parameter
I The vector of partial derivatives is called the gradient of the
∂E ∂E ∂E
error. It is written ∇E = ( ∂w ,
1 ∂w2
, . . . , ∂w n
). Alternate
∂E
notation ∂w .
I It points in the direction of steepest error descent in weight
space.
I Three crucial questions:
I How do we compute the gradient ∇E efficiently?
I Once we have the gradient, how do we minimize the error?
I Where will we end up in weight space?

6 / 24
Numerical Optimization Algorithms

I Numerical optimization algorithms try to solve the

general problem
min E(w)
w
I Most commonly, a numerical optimization procedure takes
two inputs:
I A procedure that computes E(w)
∂E
I A procedure that computes the partial derivative ∂wj
I (Aside: Some use less information, i.e., they don’t use
gradients. Some use more information, i.e., higher order
derivative. We won’t go into these algorithms in the
course.)

7 / 24
Optimization Algorithm Cartoon
I Basically, numerical optimization algorithms are iterative.
They generate a sequence of points
w0 , w1 , w2 , . . .
E(w0 ), E(w1 ), E(w2 ), . . .
∇E(w0 ), ∇E(w1 ), ∇E(w2 ), . . .
I Basic optimization algorithm is

initialize w
while E(w) is unacceptably high
calculate g = ∇E
Compute direction d from w, E(w), g
(can use previous gradients as well...)
w←w−η d
end while
return w
8 / 24
A Choice of Direction

I The simplest choice d is the current gradient ∇E.

I It is locally the steepest descent direction.
I (Technically, the reason for this choice is Taylor’s theorem
from calculus.)

9 / 24
Gradient Descent

I Simple gradient descent algorithm:

initialize w
while E(w) is unacceptably high
∂E
calculate g ← ∂w
w←w−η g
end while
return w
I η is known as the step size (sometimes called learning
rate)
I We must choose η > 0.
I η too small → too slow
I η too large → instability

10 / 24
Effect of Step Size

Goal: Minimize
E(w) = w 2
I Take η = 0.1. Works well.
8

w0 = 1.0
6
w1 = w0 − 0.1 · 2w0 = 0.8
E(w)

4
w2 = w1 − 0.1 · 2w1 = 0.64
2 w3 = w2 − 0.1 · 2w2 = 0.512
0 ···
−3 −2 −1 0 1 2 3 w25 = 0.0047
w

11 / 24
Effect of Step Size

I Take η = 1.1. Not so good. If you

Goal: Minimize
step too far, you can leap over the
E(w) = w 2
region that contains the minimum
8
w0 = 1.0
6
w1 = w0 − 1.1 · 2w0 = −1.2
E(w)

4 w2 = w1 − 1.1 · 2w1 = 1.44

2 w3 = w2 − 1.1 · 2w2 = −1.72
···
0
w25 = 79.50
−3 −2 −1 0 1 2 3

w I Finally, take η = 0.000001. What

happens here?

12 / 24
“Bold Driver” Gradient Descent

I Simple heuristic for choosing η which you can use if you’re

desperate.
initialize w, η
initialize e ← E(w); g ← ∇E(w) while η > 0
w1 ← w − ηg
e1 = E(w1 ); g1 = ∇E
if e1 ≥ e
η = η/2
else
η = 1.01η; w ← w1 ; g ← g1 ; e = e1
end while
return w
I Finds a local minimum of E.

13 / 24
Batch vs online
I So far all the objective function we have seen look like:
n
X
E(w; D) = Ei (w; yi , xi ).
i=1
D = {(x1 , y1 ), (x2 , y2 ), . . . (xn , yn )} is the training set.
I Each term sum depends on only one training instance
I Example: Logistic regression: Ei (w; yi , xi ) = log p(yi |xi , w).
I The gradient in this case is always
n
X ∂Ei
∂E
=
∂w ∂w
i=1
I The algorithm on slide 10 scans all the training instances
before changing the parameters.
I Seems dumb if we have millions of training instances.
Surely we can get a gradient that is “good enough” from
fewer instances, e.g., a couple of thousand? Or maybe
even from just one? 14 / 24
Batch vs online

I Batch learning: use all patterns in training set, and update

weights after calculating

∂E X ∂Ei
=
∂w ∂w
i

I On-line learning: adapt weights after each pattern

presentation, using ∂E
∂w
i

I Batch more powerful optimization methods

I Batch easier to analyze
I On-line more feasible for huge or continually growing
datasets
I On-line may have ability to jump over local optima

15 / 24
Algorithms for Batch Gradient Descent

I Here is batch gradient descent.

initialize w
while E(w) is unacceptably high
P
calculate g ← N ∂Ei
i=1 ∂w
w←w−η g
end while
return w
I This is just the algorithm we have seenP before. We have
just “substituted in” the fact that E = N
i=1 Ei .

16 / 24
Algorithms for Online Gradient Descent

I Here is (a particular type of) online gradient descent

algorithm
initialize w
while E(w) is unacceptably high
Pick j as uniform random integer in 1 . . . N
∂E
calculate g ← ∂wj
w←w−η g
end while
return w
I This version is also called “stochastic gradient ascent”
because we have picked the training instance randomly.
I There are other variants of online gradient descent.

17 / 24
Problems With Gradient Descent

I Setting the step size η

I Shallow valleys
I Highly curved error surfaces
I Local minima

18 / 24
Shallow Valleys

I Typical gradient descent can be fooled in several ways,

t 8 which is why more sophisticated Momentum methods are10 used when
w do we
possible.• Ifquickly
the error surface is a long and narrow valley, grad. descent goes
One problem:
down the valley walls but very slowly along the valley bottom.

0;
dE
dw

• We can alleviate this by updating our parameters using a

ad; I Gradient combination
descentof the goes very
previous updateslowly once
and the gradient it hits the shallow
update:
∆wjt = β∆wjt−1 + (1 − β) $ ∂E/∂wj (wt)
valley.
}
• Usually, β is quite high, about 0.95.
t.
I One hack to we
• When deal with
have to retractthis
a step,is
we momentum
set ∆wj to zero.
may be better. • Physically, this is like giving momentum to our weights.

dt = βdt−1 + (1 − β)η∇E(wt )
9 Mini-Batch and Online Optimization 11
I Now you have to set both η and β. Can be difficult and
different • When our data is big, computing the exact gradient is expensive.
cessarily point irritating.• This seems wasteful, since the only thing we are going to use the
gradient for is to make a small change to the weights and then
throw it away and measure it again at the new weights.
19 / 24
step=step*1.01; errold=err; w0=w; gradold=grad; combina
end
Curved Error Surfaces end
}
• Usually,
• This algorithm only finds a local minimum of the cost. • When we
• This is batch grad. descent, but mini-batch or online may be better. • Physicall

I A second problem Curved with gradient descent is that 9the

Error Surfaces M
gradient•might
Notice: thenot
errorpoint towards
surface may the optimum.
be curved differently in different This is • When ou
because directions.
of This means that the gradient does not necessarily point
curvature
directly at the nearest local minimum.
• This seem
gradient
throw it
dE • An appro
dW
in line w
• The local geometry of curvature is measured by the Hessian matrix • One very
Hij = ∂ 2E/∂w examples
I Note: gradient
of second is the locally
derivatives: steepest
i wj . direction. Need not an updat
• Eigenvectors/values of the Hessian describe the directions of
directly point toward local optimum.
principal curvature and the amount of curvature in each direction.
mini-bat
Near a local minimum, the Hessian is positive definite. • In the lim
I Local curvature is measured by the Hessian matrix: online gr
2 • Maximum sensible stepsize is λ 2
Hij = ∂ E/∂w i wj .
Rate of convergence
max
depends on (1 − 2 λmin ).
• These m
λmax and are v

20 / 24
Local Minima
16 Convexity, Local Optima 18
I If you follow the gradient,
• Unfortunately, where
many error functions will
while you end
differentiable are up?
not Once you
unimodal. When using gradient descent we can get stuck in local
hit a local minimum, gradient is 0, so you
minima. Where we end up depends on where we start.
stop.
dient,
s and

error
move in
ays
evious parameter space

• Some very nice error functions (e.g. linear least squares, logistic
I Certain regression,
nice functions, such
lasso) are convex, and as
thussquared
have a uniqueerror,
(global) logistic
minimum.
regression Convexity means
likelihood that the second
are convex, derivative isthat
meaning alwaysthe second
+1) positive. No linear combination of weights can have greater error
derivative
thanisthealways positive.
linear combination of theThis
originalimplies
errors. that any local
w(t) minimum
• Butismostglobal.
settings do not lead to convex optimization problems.

I There is no great solution to this problem. It is a

fundamental one. Usually, the best you can do is rerun the
17 19
optimizer multipleConstrained
times from Optimization
different random starting
nt that points.• Sometimes we want to optimize with some constraints on the
parameters.
e.g. variances are always positive
e.g. priors are non-negative and sum to unity (live on the simplex) 21 / 24
Advanced Topics That We Will Not Cover (Part I)

I Some of these issues (shallow valley, curved error

surfaces) can be fixed
I Some of these are second-order methods like Newton’s
method that use the second derivatives
I Also there are fancy first-order methods like quasi-Newton
methods (e.g., limited memory BFGS) and conjugate
gradient
I They are the state of the art methods for logistic regression
(as long as there are not too many data points)
I We will not discuss these methods in the course.
I Other issues (like local minima) cannot be easily fixed

22 / 24
Advanced Topics That We Will Not Cover (Part II)

I Sometimes the optimization problem has constraints

I Example: Observe the points {0.5, 1.0} from a Gaussian
with known mean µ = 0.8 and unknown standard deviation
σ. Want to estimate σ by maximum likelihood.
I Constraint: σ must be positive.
I In this case to find the maximum likelihood solution, the
optimization problem is
2
X 1
max (xi − µ)2
µ,σ 2σ 2
i=1
subject to σ > 0

I There are ways to solve this (in this case: can be done
analytically). We will not discuss them in this course.

23 / 24
Summary

I Complex mathematical area. Do not implement your own

optimization algorithms if you can help it!
I Stuff you should understand:
I How and why we convert learning problems into
optimization problems
I Modularity between modelling and optimization
I Gradient descent
I Why gradient descent can run into problems
I Especially local minima
I Methods of choice: Fancy first-order methods (e.g.,
quasi-Newton, CG) for moderate amounts of data.
Stochastic gradient for large amounts of data.

24 / 24

DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
Optimization
No ratings yet
Optimization
44 pages
Contemporary Communication Systems Using Matlab Proakis and Salehi
100% (2)
Contemporary Communication Systems Using Matlab Proakis and Salehi
443 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
DL Unit 2
No ratings yet
DL Unit 2
46 pages
Deep Neura Network Lab
No ratings yet
Deep Neura Network Lab
11 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Experiment 1
No ratings yet
Experiment 1
15 pages
02 Lecturenote GD
No ratings yet
02 Lecturenote GD
10 pages
Lecture 03
No ratings yet
Lecture 03
32 pages
Geoffrey Hinton With Nitish Srivastava Kevin Swersky: Neural Networks For Machine Learning
No ratings yet
Geoffrey Hinton With Nitish Srivastava Kevin Swersky: Neural Networks For Machine Learning
31 pages
MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
Amazon Interview Question Bank - by Harine
No ratings yet
Amazon Interview Question Bank - by Harine
4 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Optim
No ratings yet
Optim
33 pages
Neural Networks For Machine Learning: Lecture 6a Overview of Mini - Batch Gradient Descent
No ratings yet
Neural Networks For Machine Learning: Lecture 6a Overview of Mini - Batch Gradient Descent
31 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
Notes HQ
No ratings yet
Notes HQ
96 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
04 Numerical
No ratings yet
04 Numerical
39 pages
05 Gradient Descent
No ratings yet
05 Gradient Descent
23 pages
Optimization
No ratings yet
Optimization
6 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
Numerical Methods & Probability Theory (20A54402) : Lecture Notes
No ratings yet
Numerical Methods & Probability Theory (20A54402) : Lecture Notes
199 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Backpropagation Optimization Tutorial
No ratings yet
Backpropagation Optimization Tutorial
14 pages
Cours 5
No ratings yet
Cours 5
23 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
04 Numerical
No ratings yet
04 Numerical
46 pages
FALLSEM2024-25 BCSE401L TH VL2024250102084 2024-09-03 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE401L TH VL2024250102084 2024-09-03 Reference-Material-I
16 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
cs188 Fa23 Note23
No ratings yet
cs188 Fa23 Note23
2 pages
Digital Communication
No ratings yet
Digital Communication
148 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Doing Business in Hungary
No ratings yet
Doing Business in Hungary
22 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
W3 Ecs7020p
No ratings yet
W3 Ecs7020p
51 pages
Lec 6 Tutorial
No ratings yet
Lec 6 Tutorial
27 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
L07 Optimization
No ratings yet
L07 Optimization
12 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Technical Writing
No ratings yet
Technical Writing
9 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Unit 1.2 Algorithms-and-Flowchart
No ratings yet
Unit 1.2 Algorithms-and-Flowchart
29 pages
Notes
No ratings yet
Notes
6 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Class X-Maths-Polynomials-Aecs2 Mumbai
No ratings yet
Class X-Maths-Polynomials-Aecs2 Mumbai
5 pages
Digital Image Processing - Unit 3 - Week 2 - Aast2
100% (1)
Digital Image Processing - Unit 3 - Week 2 - Aast2
5 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
23-Practical Aspects of Optimization
No ratings yet
23-Practical Aspects of Optimization
7 pages
Bio Statslectures
No ratings yet
Bio Statslectures
60 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
TS Part2
No ratings yet
TS Part2
62 pages
Module 5 Hashing
No ratings yet
Module 5 Hashing
66 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
Alpha Beta Pruning Algorithm
No ratings yet
Alpha Beta Pruning Algorithm
27 pages
PMRslides 03 B
No ratings yet
PMRslides 03 B
45 pages
Art of Programming Through Algorithms and Flowcharts in C
No ratings yet
Art of Programming Through Algorithms and Flowcharts in C
7 pages
Backpropagation LectureNotesPublic
No ratings yet
Backpropagation LectureNotesPublic
13 pages
Machine Learning and Pattern Recognition - Laplace - Approximation
No ratings yet
Machine Learning and Pattern Recognition - Laplace - Approximation
4 pages
Artificial Neural Network Using R
No ratings yet
Artificial Neural Network Using R
15 pages
Part 3
No ratings yet
Part 3
29 pages
Biological Data Science Lecture4
No ratings yet
Biological Data Science Lecture4
21 pages
Part 4
No ratings yet
Part 4
24 pages
MDA3S
No ratings yet
MDA3S
22 pages
Slides 03 A
No ratings yet
Slides 03 A
21 pages
Biological Data Science Lecture6
No ratings yet
Biological Data Science Lecture6
29 pages
Award in Education and Training Sample
No ratings yet
Award in Education and Training Sample
9 pages
Binary Search Algorithm
100% (1)
Binary Search Algorithm
12 pages
W2e Multivariate Gaussian
No ratings yet
W2e Multivariate Gaussian
6 pages
Lung Cancer Detection Using Image Processing Synopsis Report
No ratings yet
Lung Cancer Detection Using Image Processing Synopsis Report
19 pages
Bayesian Week4 LectureNotes
No ratings yet
Bayesian Week4 LectureNotes
15 pages
Part 5
No ratings yet
Part 5
31 pages
MATH11183 Week 1-Part 2
No ratings yet
MATH11183 Week 1-Part 2
18 pages
Reinforcement Learning Syllabus
No ratings yet
Reinforcement Learning Syllabus
1 page
PMRslides 02
No ratings yet
PMRslides 02
13 pages
Week 8 Pca
No ratings yet
Week 8 Pca
26 pages
Heat Advection
No ratings yet
Heat Advection
12 pages
Chapter 6
No ratings yet
Chapter 6
7 pages
Digital Control System 5
100% (1)
Digital Control System 5
2 pages
2017 AMAM Exam Paper
No ratings yet
2017 AMAM Exam Paper
6 pages
Week 2 Naive Bayes
No ratings yet
Week 2 Naive Bayes
15 pages
TSEAV Modul 2-Ekperimen 5 Motion Estimation For Video Coding With MATLAB - Id
No ratings yet
TSEAV Modul 2-Ekperimen 5 Motion Estimation For Video Coding With MATLAB - Id
15 pages
BDS 2016-17
No ratings yet
BDS 2016-17
4 pages
Q) Matrix Chain Multiplication Problem ?: Conclusion
No ratings yet
Q) Matrix Chain Multiplication Problem ?: Conclusion
6 pages
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
No ratings yet
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
3 pages
Expt 1 - First Come First Serve Scheduling (FCFS)
No ratings yet
Expt 1 - First Come First Serve Scheduling (FCFS)
2 pages
BDS 2018-19
No ratings yet
BDS 2018-19
6 pages
Edge Enhancement Based Transformer For Medical Image Denoising PDF
No ratings yet
Edge Enhancement Based Transformer For Medical Image Denoising PDF
8 pages
Seam Carving
No ratings yet
Seam Carving
7 pages
MLPR w0f - Machine Learning and Pattern Recognition
No ratings yet
MLPR w0f - Machine Learning and Pattern Recognition
3 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
Bayesian Workshop1 Solution
No ratings yet
Bayesian Workshop1 Solution
3 pages
2019 AMAM Exam Paper
No ratings yet
2019 AMAM Exam Paper
3 pages
W6a Gaussian Process Kernels
No ratings yet
W6a Gaussian Process Kernels
6 pages
Quadratic Equations Class 10 Notes CBSE Maths Chapter 4 (PDF)
No ratings yet
Quadratic Equations Class 10 Notes CBSE Maths Chapter 4 (PDF)
4 pages
The Design Revolution of Logarithmic Number System Architecture
No ratings yet
The Design Revolution of Logarithmic Number System Architecture
7 pages
w9b Netflix Prize
No ratings yet
w9b Netflix Prize
3 pages
3D Object Detection & 6D Pose Estimation
No ratings yet
3D Object Detection & 6D Pose Estimation
7 pages
Syllabus
No ratings yet
Syllabus
3 pages
Ant Colony Based Load Flow Optimisation Using Matlab: Upamanyu Keshav Bansal, Miteshwar Singh
No ratings yet
Ant Colony Based Load Flow Optimisation Using Matlab: Upamanyu Keshav Bansal, Miteshwar Singh
5 pages
SVD - Singular Value Decomposition: Syntax
No ratings yet
SVD - Singular Value Decomposition: Syntax
5 pages
w2c Central Limit
No ratings yet
w2c Central Limit
1 page
1.2.1. Machine Learning
No ratings yet
1.2.1. Machine Learning
2 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

Week 5 Optimisation

Uploaded by

Week 5 Optimisation

Uploaded by

IAML: Optimization

I Why we use optimization in machine learning

I A main idea in machine learning is to convert the learning

`(w) = log p(y1 , x1 , y2 , x2 , . . . , yn , xn |w)

I Example: Linear regression

Key idea: If E is continuous, then measuring E(w) gives

I If we wiggle wk and keep everything else the same, does

I Numerical optimization algorithms try to solve the

I The simplest choice d is the current gradient ∇E.

I Simple gradient descent algorithm:

I Take η = 1.1. Not so good. If you

4 w2 = w1 − 1.1 · 2w1 = 1.44

w I Finally, take η = 0.000001. What

I Simple heuristic for choosing η which you can use if you’re

I Batch learning: use all patterns in training set, and update

I On-line learning: adapt weights after each pattern

I Batch more powerful optimization methods

I Here is batch gradient descent.

I Here is (a particular type of) online gradient descent

I Setting the step size η

I Typical gradient descent can be fooled in several ways,

• We can alleviate this by updating our parameters using a

I A second problem Curved with gradient descent is that 9the

I There is no great solution to this problem. It is a

I Some of these issues (shallow valley, curved error

I Sometimes the optimization problem has constraints

I Complex mathematical area. Do not implement your own

You might also like