Optimization For Machine Learning
Optimization For Machine Learning
Disclaimer
The information contained within this eBook is strictly for educational purposes. If you wish to apply
ideas contained in this eBook, you are taking full responsibility for your actions.
The author has made every effort to ensure the accuracy of the information within this book was
correct at time of publication. The author does not assume and hereby disclaims any liability to any
party for any loss, damage, or disruption caused by errors or omissions, whether such errors or
omissions result from accident, negligence, or any other cause.
No part of this eBook may be reproduced or transmitted in any form or by any means, electronic or
mechanical, recording or by any information storage and retrieval system, without written permission
from the author.
Credits
Editor: Adrian Tam
Technical reviewers: Andrei Cheremskoy and Arun Koshy
Copyright
Optimization for Machine Learning
© 2021Ű2023 MachineLearningMastery.com. All Rights Reserved.
Edition: v1.05
Contents
Copyright i
Preface 1
Introduction 3
I Foundations 6
1 What is Function Optimization? 7
Tutorial Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Function Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Candidate Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Evaluation Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
II Background 25
4 No Free Lunch Theorem for Machine Learning 26
Tutorial Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
What Is the No Free Lunch Theorem? . . . . . . . . . . . . . . . . . . . . 26
Implications for Optimization. . . . . . . . . . . . . . . . . . . . . . . . 28
Implications for Machine Learning. . . . . . . . . . . . . . . . . . . . . . 29
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6 Premature Convergence 37
Tutorial Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Convergence in Machine Learning . . . . . . . . . . . . . . . . . . . . . . 38
Premature Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Addressing Premature Convergence . . . . . . . . . . . . . . . . . . . . . 39
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
VI Projects 316
27 Use Optimization Algorithms to Manually Fit Regression Models 317
Tutorial Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Optimize Regression Models . . . . . . . . . . . . . . . . . . . . . . . . 317
Optimize a Linear Regression Model . . . . . . . . . . . . . . . . . . . . . 318
Optimize a Logistic Regression Model . . . . . . . . . . . . . . . . . . . . 325
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
This book is to help machine learning practitioners to understand the optimization algorithms
that we regularly encounter.
All machine learning models involve optimization. From the simplest model, such as linear
regression, we assumed the data is in a linear relationship and then we work toward the values
of the coefficients such that the difference between model prediction and the input data is
minimized. Such examples are everywhere. However, what distinguishes the optimization in
mathematics and the optimization in machine learning is that we almost always have to resort
to numerical and computational methods, rather than algebraic methods, to optimize. While
algebraic methods may give unique close-form solutions, there are numerous computational
algorithms for optimization and all give approximate solutions in various degree. Not a single
algorithm can be a silver bullet to all problems.
Therefore, as a machine learning practitioner, we need to know what the different
optimization algorithms are about, as well as their strengths and weaknesses. We need to know
that so we can pick the most suitable one for a particular problem. Likewise, we need to know
what we can Ąne-tune in case the machine learning result needs improvement, or at least draw
some conclusions on the nature of the problem if the machine learning result is not as good as
we expect. To get the result Ąrst, you should probably try out some machine learning projects
instead of reading this book. But once you Ąnish a few projects and wonder what the computer
was doing in ŞtrainingŤ and why it took some noticeable time to give you the trained model,
this book will give you some clues.
Just like any topic in the theoretical side of machine learning, optimization can be a very
deep and broad subject. It is important to know a bit of everything so that it makes sense to
you when you read the API documentation or other peopleŠs work. This book does not aim to
be a comprehensive guide. Indeed, most practitioners do not need to, for example, evaluate the
numerical stability of different optimization algorithm, nor know how to Ąnd the optima in a
nonlinear function mathematically. If you wish to, you can go deeper with the more rigorous
academic titles. The goal of this book is to provide you an overview and show you how the
different optimization algorithms can be applied.
Following the top-down approach of our other books, we let you see how we can optimize
a function of our choice with the different algorithm. We use Python in the examples. Hence
you may reuse the code by simply replacing the objective function with your own, and help
Preface 2
you solve your own problems. This book is put together with the hope that, you will not see
gradient descent and the names of other algorithms as magic that only appear in the machine
learning libraries. Rather, you can Ąnd them as generic methods to Ąnd the optimum value of a
numerical function, and it just happened that helping machine learning models to train is an
example of such.
Introduction
What to expect?
This book will teach you the basics of some optimization algorithms that you need to know as a
machine learning practitioner. After reading and working through the book, you will know:
▷ What is function optimization and why it is relevant and important to machine learning
Introduction 4
▷ The trade-off in applying optimization algorithms, and the trade-off in tuning the
hyperparameters
▷ The difference between local optimal and global optimal
▷ How to visualize the progress and result of function optimization algorithms
▷ The stochastic nature of optimization algorithms
▷ Optimization by random search or grid search
▷ Carrying out local optimization by pattern search, quasi-Newton, least-square, and hill
climbing methods
▷ Carrying out global optimization using evolution algorithms and simulated annealing
▷ The difference in various gradient descent algorithms, including momentum, AdaGrad,
RMSProp, Adadelta, and Adam; and how to use them
▷ How to apply optimization to common machine learning tasks
This book is not to substitute the optimization or numerical methods course in
undergraduate college. The textbooks for such courses will bring you deeper theoretical
understanding, but this book could complement them with practical examples. For some
examples of the textbooks and other resources on optimization, see the Further Readings section
at the end of each chapter.
▷ Part V: Gradient Descent. Introduce the common gradient descent algorithms that
we may encounter in, for example, neural network models. Examples are given on how
they are implemented.
▷ Part VI: Projects. Four examples are given to show how the function optimization
algorithms can be used to solve a real problem.
These are not designed to tell you everything, but to give you understanding of how they
work and how to use them. This is to help you learn by doing so you can get the result the
fastest.
“ unknowns. Our goal is to Ąnd values of the variables that optimize the objective.
Ů Page 2, Numerical Optimization, 2006. ”
Function Optimization involves three elements: the input to the function (e.g. x), the
objective function itself (e.g. f (x)) and the output from the function (e.g. cost, y).
▷ Input x: The input to the function to be evaluated, i.e. a candidate solution.
▷ Function f (): The objective function or target function that evaluates inputs.
▷ Cost y: The result of evaluating a candidate solution with the objective function,
minimized or maximized.
LetŠs take a closer look at each element in turn.
Importantly, candidate solutions are discrete and there are many of them. The universe of
candidate solutions may be vast, too large to enumerate. Instead, the best we can do is sample
candidate solutions in the search space. As a practitioner, we seek an optimization algorithm
that makes the best use of the information available about the problem to effectively sample
the search space and locate a good or best candidate solution.
▷ Search Space: Universe of candidate solutions deĄned by the number, type, and range
of accepted inputs to the objective function.
Finally, candidate solutions can be rank-ordered based on their evaluation by the objective
function, meaning that some are better than others.
of the inputs and computational cost of function evaluations makes mapping and plotting real
objective functions intractable.
“ optimal, but we do not generally know whether a local minimum is a global minimum.
“ point at which the objective function is smaller than at all other feasible nearby
1.6 Further Reading 11
points. They do not always Ąnd the global solution, which is the point with lowest
function value among all feasible points. Global solutions are needed in some
applications, but for many problems they are difficult to recognize and even more
difficult to locate.
Ů Page 6, Algorithms for Optimization, 2019. ”
On more challenging problems, we may be happy with a relatively good candidate solution
(i.e. good enough) given the time available for the project.
Books
Jorge Nocedal and Stephen Wright. Numerical Optimization. 2nd ed. Springer, 2006.
https://amzn.to/3sbjF2t
Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge, 2004.
https://amzn.to/34mvCr1
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
Articles
Mathematical optimization. Wikipedia.
https://en.wikipedia.org/wiki/Mathematical_optimization
1.7 Summary
In this tutorial, you discovered a gentle introduction to function optimization. SpeciĄcally, you
learned:
▷ The three elements of function optimization as candidate solutions, objective functions
and cost.
▷ The conceptualization of function optimization as navigating a search space and response
surface.
▷ The difference between global optima and local optima when solving a function
optimization problem.
Next, you will see how function optimization can help machine learning.
Optimization and Machine
Learning
2
Machine learning involves using an algorithm to learn and generalize from historical data in
order to make predictions on new data. This problem can be described as approximating a
function that maps examples of inputs to examples of outputs. Approximating a function can
be solved by framing the problem as function optimization. This is where a machine learning
algorithm deĄnes a parameterized mapping function (e.g. a weighted sum of inputs) and an
optimization algorithm is used to fund the values of the parameters (e.g. model coefficients) that
minimize the error of the function when used to map inputs to outputs. This means that each
time we Ąt a machine learning algorithm on a training dataset, we are solving an optimization
problem. In this tutorial, you will discover the central role of optimization in machine learning.
After completing this tutorial, you will know:
▷ Machine learning algorithms perform function approximation, which is solved using
function optimization.
▷ Function optimization is the reason why we minimize error, cost, or loss when Ątting a
machine learning algorithm.
▷ Optimization is also performed during data preparation, hyperparameter tuning, and
model selection in a predictive modeling project.
LetŠs get started.
The learned mapping will be imperfect. No model is perfect, and some prediction error is
expected given the difficulty of the problem, noise in the observed data, and the choice of
learning algorithm. Mathematically, learning algorithms solve the problem of approximating the
mapping function by solving a function optimization problem. SpeciĄcally, given examples of
inputs and outputs, Ąnd the set of inputs to the mapping function that results in the minimum
loss, minimum cost, or minimum prediction error. The more biased or constrained the choice of
mapping function, the easier the optimization is to solve.
LetŠs look at some examples to make this clear.
A linear regression (for regression problems) is a highly constrained model and can be
solved analytically using linear algebra. The inputs to the mapping function are the coefficients
of the model. We can use an optimization algorithm, like a quasi-Newton local search algorithm,
but it will almost always be less efficient than the analytical solution.
▷ Linear Regression: Function inputs are model coefficients, optimization problems that
can be solved analytically.
A logistic regression (for classiĄcation problems) is slightly less constrained and must be solved
as an optimization problem, although something about the structure of the optimization function
being solved is known given the constraints imposed by the model. This means a local search
algorithm like a quasi-Newton method can be used. We could use a global search like stochastic
gradient descent, but it will almost always be less efficient.
▷ Logistic Regression: Function inputs are model coefficients, optimization problems that
require an iterative local search algorithm.
A neural network model is a very Ćexible learning algorithm that imposes few constraints.
The inputs to the mapping function are the network weights. A local search algorithm cannot
be used given the search space is multimodal and highly nonlinear; instead, a global search
algorithm must be used.
A global optimization algorithm is commonly used, speciĄcally stochastic gradient descent,
and the updates are made in a way that is aware of the structure of the model (backpropagation
and the chain rule). We could use a global search algorithm that is oblivious of the structure of
the model, like a genetic algorithm, but it will almost always be less efficient.
▷ Neural Network: Function inputs are model weights, optimization problems that require
an iterative global search algorithm.
We can see that each algorithm makes different assumptions about the form of the mapping
function, which inĆuences the type of optimization problem to be solved. We can also see that
the default optimization algorithm used for each machine learning algorithm is not arbitrary; it
represents the most efficient algorithm for solving the speciĄc optimization problem framed by
the algorithm, e.g. stochastic gradient descent for neural nets instead of a genetic algorithm.
Deviating from these defaults requires a good reason.
Not all machine learning algorithms solve an optimization problem. A notable example is
the k-nearest neighbors algorithm that stores the training dataset and does a lookup for the k
best matches to each new example in order to make a prediction.
Now that we are familiar with learning in machine learning algorithms as optimization,
letŠs look at some related examples of optimization in a machine learning project.
2.4 Optimization in a Machine Learning Project 15
Articles
Mathematical optimization. Wikipedia.
https://en.wikipedia.org/wiki/Mathematical_optimization
Function approximation. Wikipedia.
https://en.wikipedia.org/wiki/Function_approximation
Least-squares function approximation. Wikipedia.
https://en.wikipedia.org/wiki/Least-squares_function_approximation
Hyperparameter optimization. Wikipedia.
https://en.wikipedia.org/wiki/Hyperparameter_optimization
Model selection. Wikipedia.
https://en.wikipedia.org/wiki/Model_selection
2.6 Summary 17
2.6 Summary
In this tutorial, you discovered the central role of optimization in machine learning. SpeciĄcally,
you learned:
▷ Machine learning algorithms perform function approximation, which is solved using
function optimization.
▷ Function optimization is the reason why we minimize error, cost, or loss when Ątting a
machine learning algorithm.
▷ Optimization is also performed during data preparation, hyperparameter tuning, and
model selection in a predictive modeling project.
Next, you will be introduced to the different optimization algorithms.
How to Choose an Optimization
Algorithms
3
Optimization is the problem of Ąnding a set of inputs to an objective function that results in a
maximum or minimum function evaluation. It is the challenging problem that underlies many
machine learning algorithms, from Ątting logistic regression models to training artiĄcial neural
networks. There are perhaps hundreds of popular optimization algorithms, and perhaps tens of
algorithms to choose from in popular scientiĄc code libraries. This can make it challenging to
know which algorithms to consider for a given optimization problem.
In this tutorial, you will discover a guided tour of different optimization algorithms. After
completing this tutorial, you will know:
▷ Optimization algorithms may be grouped into those that use derivatives and those that
do not.
▷ Classical algorithms use the Ąrst and sometimes second derivative of the objective
function.
▷ Direct search and stochastic algorithms are designed for objective functions where
function derivatives are unavailable.
LetŠs get started.
values. The output from the function is also a real-valued evaluation of the input values. We
might refer to problems of this type as continuous function optimization, to distinguish from
functions that take discrete variables and are referred to as combinatorial optimization problems.
There are many different types of optimization algorithms that can be used for continuous
function optimization problems, and perhaps just as many ways to group and summarize them.
One approach to grouping optimization algorithms is based on the amount of information
available about the target function that is being optimized that, in turn, can be used and
harnessed by the optimization algorithm. Generally, the more information that is available
about the target function, the easier the function is to optimize if the information can effectively
be used in the search.
Perhaps the major division in optimization algorithms is whether the objective function can
be differentiated at a point or not. That is, whether the Ąrst derivative (gradient or slope) of
the function can be calculated for a given candidate solution or not. This partitions algorithms
into those that can make use of the calculated gradient information and those that do not.
▷ Differentiable Target Function?
◦ Algorithms that use derivative information.
◦ Algorithms that do not use derivative information.
We will use this as the major division for grouping optimization algorithms in this tutorial and
look at algorithms for differentiable and non-differentiable objective functions.
1
https://en.wikipedia.org/wiki/Differentiable_function
3.3 Differentiable Objective Function 20
We can calculate the derivative of the derivative of the objective function, that is the rate of
change of the rate of change in the objective function. This is called the second derivative.
▷ Second-Order Derivative: Rate at which the derivative of the objective function changes.
For a function that takes multiple input variables, this is a matrix and is referred to as the
Hessian matrix.
▷ Hessian matrix: Second derivative of a function with two or more input variables.
Simple differentiable functions can be optimized analytically using calculus. Typically, the
objective functions that we are interested in cannot be solved analytically. Optimization is
signiĄcantly easier if the gradient of the objective function can be calculated, and as such, there
has been a lot more research into optimization algorithms that use the derivative than those
that do not. Some groups of algorithms that use gradient information include:
▷ Bracketing Algorithms
▷ Local Descent Algorithms
▷ First-Order Algorithms
▷ Second-Order Algorithms
Note: this taxonomy is inspired by the 2019 book Algorithms for Optimization.
ò (https://amzn.to/39KZSQn)
type search in a line or hyperplane in the chosen direction. This process is repeated until no
further improvements can be made. The limitation is that it is computationally expensive to
optimize each directional move in the search space.
▷ Quasi-Newton Method
There are many Quasi-Newton Methods, and they are typically named for the developers of the
algorithm, such as:
▷ Davidson-Fletcher-Powell
▷ Broyden-Fletcher-Goldfarb-Shanno (BFGS)
▷ Limited-memory BFGS (L-BFGS)
Now that we are familiar with the so-called classical optimization algorithms, letŠs look at
algorithms used when the objective function is not differentiable.
These direct estimates are then used to choose a direction to move in the search space and
triangulate the region of the optima.
Examples of direct search algorithms include:
▷ Cyclic Coordinate Search
▷ PowellŠs Method
▷ Hooke-Jeeves Method
▷ Nelder-Mead Simplex Search
Books
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
3.6 Summary 24
Articles
Mathematical optimization. Wikipedia.
https://en.wikipedia.org/wiki/Mathematical_optimization
3.6 Summary
In this tutorial, you discovered a guided tour of different optimization algorithms.
SpeciĄcally, you learned:
▷ Optimization algorithms may be grouped into those that use derivatives and those that
do not.
▷ Classical algorithms use the Ąrst and sometimes second derivative of the objective
function.
▷ Direct search and stochastic algorithms are designed for objective functions where
function derivatives are unavailable.
Next, you will learn about some optimization concepts, starts with the no free lunch
theorem.
II
Background
No Free Lunch Theorem for
Machine Learning
4
No Free Lunch Theorem is often thrown around in the Ąeld of optimization and machine
learning, often with little understanding of what it means or implies. The theorem states that
all optimization algorithms perform equally well when their performance is averaged across all
possible problems. It implies that there is no single best optimization algorithm. Because of
the close relationship between optimization, search, and machine learning, it also implies that
there is no single best machine learning algorithm for predictive modeling problems such as
classiĄcation and regression.
In this tutorial, you will discover the no free lunch theorem for optimization and search.
After completing this tutorial, you will know:
▷ The no free lunch theorem suggests the performance of all optimization algorithms are
identical, under some speciĄc constraints.
▷ There is provably no single best optimization algorithm or machine learning algorithm.
▷ The practical implications of the theorem may be limited given we are interested in a
small subset of all possible objective functions.
LetŠs get started.
The NFL stated that within certain constraints, over the space of all possible
“ problems, every optimization technique will perform as well as every other one on
average (including Random Search)
Ů Page 203, Essentials of Metaheuristics, 2011. ”
The theorem applies to optimization generally and to search problems, as optimization can
be described or framed as a search problem. The implication is that the performance of your
favorite algorithm is identical to a completely naive algorithm, such as random search.
Roughly speaking we show that for both static and time dependent optimization
“ problems the average performance of any pair of algorithms across all possible
problems is exactly identical.
Ů ŞNo Free Lunch Theorems For OptimizationŤ, 1997. ”
An easy way to think about this Ąnding is to consider a large table like you might have in
Excel. Across the top of the table, each column represents a different optimization algorithm.
Down the side of the table, each row represents a different objective function. Each cell of the
table is the performance of the algorithm on the objective function, using whatever performance
measure you like, as long as it is consistent across the whole table.
Figure 4.1: Depiction on the No Free Lunch Theorem as a Table of Algorithms and
Problems
You can imagine that this table will be inĄnitely large. Nevertheless, in this table, we can
calculate the average performance of any algorithm from all the values in its column and it will
be identical to the average performance of any other algorithm column.
If one algorithm performs better than another algorithm on one class of problems,
“ The limit is pretty low: no learner can be better than random guessing!
Ů Page 63, The Master Algorithm, 2018. ”
The catch is that the application of algorithms does not assume anything about the problem.
In fact, algorithms are applied to objective functions with no prior knowledge, even such as
whether the objective function is minimizing or maximizing. And this is a hard constraint of
the theorem.
We often have ŞsomeŤ knowledge about the objective function being optimized. In fact,
if in practice we truly knew nothing about the objective function, we could not choose an
optimization algorithm.
As elaborated by the no free lunch theorems of Wolpert and Macready, there is no
“ reason to prefer one algorithm over another unless we make assumptions about the
probability distribution over the space of possible objective functions.
Ů Page 6, Algorithms for Optimization, 2019. ”
The beginner practitioner in the Ąeld of optimization is counseled to learn and use as much
about the problem as possible in the optimization algorithm. The more we know and harness
in the algorithms about the problem, the better tailored the technique is to the problem and
the more likely the algorithm is expected to perform well on the problem. The no free lunch
theorem supports this advice.
We donŠt care about all possible worlds, only the one we live in. If we know something
“ about the world and incorporate it into our learner, it now has an advantage over
random guessing.
Ů Page 63, The Master Algorithm, 2018.
Additionally, the performance is averaged over all possible objective functions and all
”
possible optimization algorithms. Whereas in practice, we are interested in a small subset of
objective functions that may have a speciĄc structure or form and algorithms tailored to those
functions.
. . . we cannot emphasize enough that no claims whatsoever are being made in this
“ paper concerning how well various search algorithms work in practice The focus
4.4 Implications for Machine Learning 29
of this paper is on what can be said a priori without any assumptions and from
mathematical principles alone concerning the utility of a search algorithm.
Ů ŞNo Free Lunch Theorems For OptimizationŤ, 1997.
These implications lead some practitioners to note the limited practical value of the theorem.
”
This is of considerable theoretical interest but, I think, of limited practical value,
“ because the space of all possible problems likely includes many extremely unusual
and pathological problems which are rarely if ever seen in practice.
Ů Page 203, Essentials of Metaheuristics, 2011.
Now that we have reviewed the implications of the no free lunch theorem for optimization,
”
letŠs review the implications for machine learning.
“ with for supervised learning avoid overĄtting prefer simpler to more complex models
etc. [no free lunch] says that all such heuristics fail as often as they succeed.
Ů ŞThe Supervised Learning No-Free-Lunch TheoremsŤ, 2002.
Given that there is no best single machine learning algorithm across all possible prediction
”
problems, it motivates the need to continue to develop new learning algorithms and to better
understand algorithms that have already been developed.
As a consequence of the no free lunch theorem, we need to develop many different
“ types of models, to cover the wide variety of data that occurs in the real world. And
4.5 Further Reading 30
for each model, there may be many different algorithms we can use to train the
model, which make different speed-accuracy-complexity tradeoffs.
Ů Pages 24Ű25, Machine Learning: A Probabilistic Perspective, 2012. ”
It also supports the argument of testing a suite of different machine learning algorithms for
a given predictive modeling problem.
The ŞNo Free LunchŤ Theorem argues that, without having substantive information
“ about the modeling problem, there is no single model that will always do better
than any other model. Because of this, a strong case can be made to try a wide
variety of techniques, then determine which model to focus on.
Ů Pages 25Ű26, Machine Learning: A Probabilistic Perspective, 2012. ”
Nevertheless, as with optimization, the implications of the theorem are based on the choice
of learning algorithms having zero knowledge of the problem that is being solved. In practice,
this is not the case, and a beginner machine learning practitioner is encouraged to review the
available data in order to learn something about the problem that can be incorporated into the
learning algorithm. We may even want to take this one step further and say that learning is not
possible without some prior knowledge and that data alone is not enough.
In the meantime, the practical consequence of the Şno free lunchŤ theorem is that
“ thereŠs no such thing as learning without knowledge. Data alone is not enough.
Ů Page 64, The Master Algorithm, 2018. ”
4.5 Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Papers
D. H. Wolpert and W. G. Macready. ŞNo Free Lunch Theorems For OptimizationŤ. IEEE
Transactions on Evolutionary Computation, 1(1), 1997, pp. 67Ű82.
https://ieeexplore.ieee.org/abstract/document/585893
David H. Wolpert. ŞThe Supervised Learning No-Free-Lunch TheoremsŤ. In: Soft Computing
and Industry. Ed. by R. Roy et al. Springer, 2002.
https://link.springer.com/chapter/10.1007/978-1-4471-0123-9_3
David H. Wolpert. ŞThe Lack of A Priori Distinctions Between Learning AlgorithmsŤ. Neural
Computation, 8(7), 1996, pp. 1341Ű1390.
https://www.mitpressjournals.org/doi/abs/10.1162/neco.1996.8.7.1341
David H. Wolpert and William G. Macready. No Free Lunch Theorems for Search. Tech. rep.
SFI-TR-95-02-010. The Santa Fe Institute, 1995.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.7505&rep=rep1&
type=pdf
4.6 Summary 31
Books
Sean Luke. Essentials of Metaheuristics. lulu.com, 2011.
https://amzn.to/3lHryZr
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
Pedro Domingos. The Master Algorithm. Basic Books, 2018.
https://amzn.to/3lKKGFX
Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.
https://amzn.to/3nJJe8s
Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer, 2013.
https://amzn.to/3ly7nwK
Articles
No free lunch in search and optimization. Wikipedia.
https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization
No free lunch theorem. Wikipedia.
https://en.wikipedia.org/wiki/No_free_lunch_theorem
No Free Lunch Theorems.
http://www.no-free-lunch.org/
4.6 Summary
In this tutorial, you discovered the no free lunch theorem for optimization and search. SpeciĄcally,
you learned:
▷ The no free lunch theorem suggests the performance of all optimization algorithms are
identical, under some speciĄc constraints.
▷ There is provably no single best optimization algorithm or machine learning algorithm.
▷ The practical implications of the theorem may be limited given we are interested in a
small subset of all possible objective functions.
Next, you will learn the distinction between a local and a global optimization.
Local Optimization vs. Global
Optimization
5
Optimization refers to Ąnding the set of inputs to an objective function that results in the
maximum or minimum output from the objective function. It is common to describe optimization
problems in terms of local vs. global optimization. Similarly, it is also common to describe
optimization algorithms or search algorithms in terms of local vs. global search.
In this tutorial, you will discover the practical differences between local and global
optimization. After completing this tutorial, you will know:
▷ Local optimization involves Ąnding the optimal solution for a speciĄc region of the
search space, or the global optima for problems with no local optima.
▷ Global optimization involves Ąnding the optimal solution on problems that contain local
optima.
▷ How and when to use local and global search algorithms and how to use both methods
in concert.
LetŠs get started.
An objective function may have many local optima, or it may have a single local optima, in
which case the local optima is also the global optima.
▷ Local Optimization: Locate the optima for an objective function from a starting point
believed to contain the optima (e.g. a basin).
Local optimization1 or local search refers to searching for the local optima. A local
optimization algorithm, also called a local search algorithm, is an algorithm intended to locate
a local optima. It is suited to traversing a given region of the search space and getting close to
(or Ąnding exactly) the extrema of the function in that region.
. . . local optimization methods are widely used in applications where there is value
An objective function always has a global optima (otherwise we would not be interested in
optimizing it), although it may also have local optima that have an objective function evaluation
that is not as good as the global optima. The global optima may be the same as the local
optima, in which case it would be more appropriate to refer to the optimization problem as a
local optimization, instead of global optimization.
The presence of the local optima is a major component of what deĄnes the difficulty of a
global optimization problem as it may be relatively easy to locate a local optima and relatively
difficult to locate the global optima.
Global optimization3 or global search refers to searching for the global optima. A global
optimization algorithm, also called a global search algorithm, is intended to locate a global
optima. It is suited to traversing the entire input search space and getting close to (or Ąnding
exactly) the extrema of the function.
Global optimization is used for problems with a small number of variables, where
“ computing time is not critical, and the value of Ąnding the true global solution is
very high.
Ů Page 9, Convex Optimization, 2004.
Global search algorithms may involve managing a single or a population of candidate
”
solutions from which new candidate solutions are iteratively generated and evaluated to see if
they result in an improvement and taken as the new working state. There may be debate over
what exactly constitutes a global search algorithm; nevertheless, three examples of global search
algorithms using our deĄnitions include:
▷ Genetic Algorithm
▷ Simulated Annealing
▷ Particle Swarm Optimization
Now that we are familiar with global and local optimization, letŠs compare and contrast the two.
3
https://en.wikipedia.org/wiki/Global_optimization
5.4 Local vs. Global Optimization 35
▷ Local search: When you are in the region of the global optima.
▷ Global search: When you know that there are local optima.
Local search algorithms often give computational complexity guarantees related to locating the
global optima, as long as the assumptions made by the algorithm hold. Global search algorithms
often give very few if any guarantees about locating the global optima. As such, global search is
often used on problems that are sufficiently difficult that ŞgoodŤ or Şgood enoughŤ solutions are
preferred over no solutions at all. This might mean relatively good local optima instead of the
true global optima if locating the global optima is intractable.
It is often appropriate to re-run or re-start the algorithm multiple times and record the
optima found by each run to give some conĄdence that relatively good solutions have been
located.
▷ Local search: For narrow problems where the global solution is required.
▷ Global search: For broad problems where the global optima might be intractable.
We often know very little about the response surface for an objective function, e.g. whether a
local or global search algorithm is most appropriate. Therefore, it may be desirable to establish a
baseline in performance with a local search algorithm and then explore a global search algorithm
to see if it can perform better. If it cannot, it may suggest that the problem is indeed unimodal
or appropriate for a local search algorithm.
▷ Best Practice: Establish a baseline with a local search then explore a global search on
objective functions where little is known.
Local optimization is a simpler problem to solve than global optimization. As such, the
vast majority of the research on mathematical optimization has been focused on local search
techniques.
A large fraction of the research on general nonlinear programming has focused on
“ minima and Ąnding the best regions of the design space. Unfortunately, these
methods do not perform as well in local search in comparison to descent methods.
Ů Page 162, Algorithms for Optimization, 2019.
As such, they may locate the basin for a good local optima or the global optima, but may not
”
be able to locate the best solution within the basin.
Local and global optimization techniques can be combined to form hybrid training
“ algorithms.
Ů Page 37, Computational Intelligence: An Introduction, 2007. ”
Therefore, it is a good practice to apply a local search to the optima candidate solutions found
by a global search algorithm.
▷ Best Practice: Apply a local search to the solutions found by a global search.
5.5 Further Reading 36
Books
Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge, 2004.
https://amzn.to/34mvCr1
Andries P. Engelbrecht. Computational Intelligence: An Introduction. 2nd ed. Wiley, 2007.
https://amzn.to/3ob61KA
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
Articles
Local search (optimization). Wikipedia.
https://en.wikipedia.org/wiki/Local_search_(optimization)
Global optimization. Wikipedia.
https://en.wikipedia.org/wiki/Global_optimization
5.6 Summary
In this tutorial, you discovered the practical differences between local and global optimization.
SpeciĄcally, you learned:
▷ Local optimization involves Ąnding the optimal solution for a speciĄc region of the
search space, or the global optima for problems with no local optima.
▷ Global optimization involves Ąnding the optimal solution on problems that contain local
optima.
▷ How and when to use local and global search algorithms and how to use both methods
in concert.
Next, you will learn about convergence and what happens if an algorithm converge
prematurely.
Premature Convergence
6
Convergence refers to the limit of a process and can be a useful analytical tool when evaluating
the expected performance of an optimization algorithm. It can also be a useful empirical tool
when exploring the learning dynamics of an optimization algorithm, and machine learning
algorithms trained using an optimization algorithm, such as deep learning neural networks.
This motivates the investigation of learning curves and techniques, such as early stopping. If
optimization is a process that generates candidate solutions, then convergence represents a
stable point at the end of the process when no further changes or improvements are expected.
Premature convergence refers to a failure mode for an optimization algorithm where the process
stops at a stable point that does not represent a globally optimal solution.
In this tutorial, you will discover a gentle introduction to premature convergence in machine
learning. After completing this tutorial, you will know:
▷ Convergence refers to the stable point found at the end of a sequence of solutions via
an iterative optimization algorithm.
▷ Premature convergence refers to a stable point found too soon, perhaps close to the
starting point of the search, and with a worse evaluation than expected.
▷ Greediness of an optimization algorithm provides a control over the rate of convergence
of an algorithm.
LetŠs get started.
“ in that direction and repeating that process until convergence or some termination
condition is met.
Ů Page 13, Algorithms for Optimization, 2019.
▷ Convergence: Stop condition for an optimization algorithm where a stable point
”
is located and further iterations of the algorithm are unlikely to result in further
improvement.
We might measure and explore the convergence of an optimization algorithm empirically, such
as using learning curves. Additionally, we might also explore the convergence of an optimization
algorithm analytically, such as a convergence proof or average case computational complexity.
Strong selection pressure results in rapid, but possibly premature, convergence.
“ landscapes that change over time, too much exploitation generally results in
premature convergence to suboptimal peaks in the space.
Ů Page 60, Evolutionary Computation: A UniĄed Approach, 2002. ”
In this way, premature convergence is described as Ąnding a locally optimal solution instead
of the globally optimal solution for an optimization algorithm. It is a speciĄc failure case for an
optimization algorithm.
▷ Premature Convergence: Convergence of an optimization algorithm to a worse than
optimal stable point that is likely close to the starting point.
Put another way, convergence signiĄes the end of the search process, e.g. a stable point was
located and further iterations of the algorithm are not likely to improve upon the solution.
Premature convergence refers to reaching this stop condition of an optimization algorithm at a
less than desirable stationary point.
“ rapidly than operators with a low selective pressure, which may lead to premature
convergence to suboptimal solutions. A high selective pressure limits the exploration
abilities of the population.
Ů Page 135, Computational Intelligence: An Introduction, 2007.
This idea of selective pressure is helpful more generally in understanding the learning
”
dynamics of optimization algorithms. For example, an optimization that is conĄgured to be
too greedy (e.g. via hyperparameters such as the step size or learning rate) may fail due to
6.5 Further Reading 40
premature convergence, whereas the same algorithm that is conĄgured to be less greedy may
overcome premature convergence and discover a better or globally optimal solution.
Premature convergence may be encountered when using stochastic gradient descent to train
a neural network model, signiĄed by a learning curve that drops exponentially quickly then
stops improving.
The number of updates required to reach convergence usually increases with training
“ set size. However, as m approaches inĄnity, the model will eventually converge to its
best possible test error before SGD has sampled every example in the training set.
Ů Page 153, Deep Learning, 2016.
The fact that Ątting neural networks are subject to premature convergence motivates the
”
use of methods such as learning curves to monitor and diagnose issues with the convergence
of a model on a training dataset, and the use of regularization, such as early stopping, that
halts the optimization algorithm prior to Ąnding a stable point comes at the expense of worse
performance on a holdout dataset. As such, much research into deep learning neural networks
is ultimately directed at overcoming premature convergence.
Empirically, it is often found that ŚtanhŠ activation functions give rise to faster
“ initial points being so unstable that the algorithm encounters numerical difficulties
and fails altogether.
Ů Page 301, Deep Learning, 2016. ”
This also includes the vast number of variations and extensions of the stochastic gradient
descent optimization algorithm, such as the addition of momentum so that the algorithm does
not overshoot the optima (stable point), and Adam that adds an automatically adapted step
size hyperparameter (learning rate) for each parameter that is being optimized, dramatically
speeding up convergence.
Books
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
Kenneth A. De Jong. Evolutionary Computation: A UniĄed Approach. Bradford Book, 2002.
https://amzn.to/2LjWceK
6.6 Summary 41
Articles
Limit of a sequence. Wikipedia.
https://en.wikipedia.org/wiki/Limit_of_a_sequence
Convergence of random variables. Wikipedia.
https://en.wikipedia.org/wiki/Convergence_of_random_variables
Premature convergence. Wikipedia.
https://en.wikipedia.org/wiki/Premature_convergence
6.6 Summary
In this tutorial, you discovered a gentle introduction to premature convergence in machine
learning. SpeciĄcally, you learned:
▷ Convergence refers to the stable point found at the end of a sequence of solutions via
an iterative optimization algorithm.
▷ Premature convergence refers to a stable point found too soon, perhaps close to the
starting point of the search, and with a worse evaluation than expected.
▷ Greediness of an optimization algorithm provides a control over the rate of convergence
of an algorithm.
Next, you will learn how to plot a function so you can see the progress of optimization
visually.
Creating Visualization for
Function Optimization
7
Function optimization involves Ąnding the input that results in the optimal value from an
objective function. Optimization algorithms navigate the search space of input variables in
order to locate the optima, and both the shape of the objective function and behavior of
the algorithm in the search space are opaque on real-world problems. As such, it is common
to study optimization algorithms using simple low-dimensional functions that can be easily
visualized directly. Additionally, the samples in the input space of these simple functions made
by an optimization algorithm can be visualized with their appropriate context. Visualization
of lower-dimensional functions and algorithm behavior on those functions can help to develop
the intuitions that can carry over to more complex higher-dimensional function optimization
problems later.
In this tutorial, you will discover how to create visualizations for function optimization in
Python. After completing this tutorial, you will know:
▷ Visualization is an important tool when studying function optimization algorithms.
▷ How to visualize one-dimensional functions and samples using line plots.
▷ How to visualize two-dimensional functions and samples using contour and surface plots.
LetŠs get started.
f (x) = x2
This has an optimal value with an input of x = 0.0, which equals 0.0. The example below
implements this objective function and evaluates a single input.
def objective(x):
return x**2.0
Running the example evaluates the value 4.0 with the objective function, which equals 16.0.
f(4.000) = 16.000
Output 7.1: Result of Program 7.1
...
# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
inputs = arange(r_min, r_max, 0.1)
# summarize some of the input domain
print(inputs[:5])
Program 7.2: Set up uniform input range
...
# compute targets
results = objective(inputs)
# summarize some of the results
print(results[:5])
Program 7.3: Compute target from each input
7.3 Visualize 1D Function Optimization 45
Finally, we can check some of the input and their corresponding outputs.
...
for i in range(5):
print('f(%.3f) = %.3f' % (inputs[i], results[i]))
Program 7.4: Creating a mapping of some inputs to some results
Tying this together, the complete example of sampling the input space and evaluating all points
in the sample is listed below.
# objective function
def objective(x):
return x**2.0
Running the example Ąrst generates a uniform sample of input points as we expected. The
input points are then evaluated using the objective function and Ąnally, we can see a simple
mapping of inputs to outputs of the objective function.
Now that we have some conĄdence in generating a sample of inputs and evaluating them with
the objective function, we can look at generating plots of the function.
space are ordered from smallest to largest. This ordering is important as we expect (hope) that
the output of the objective function has a similar smooth relationship between values, i.e. small
changes in input result in locally consistent (smooth) changes in the output of the function.
In this case, we can use the samples to generate a line plot of the objective function with
the input points (x) on the x-axis of the plot and the objective function output (results) on the
y-axis of the plot.
...
# create a line plot of input vs result
pyplot.plot(inputs, results)
# show the plot
pyplot.show()
Program 7.6: Creating line plot of the function
# objective function
def objective(x):
return x**2.0
Running the example creates a line plot of the objective function. We can see that the function
has a large U-shape, called a parabola1 . This is a common shape when studying curves, i.e. the
study of calculus2 .
1
https://en.wikipedia.org/wiki/Parabola
2
https://en.wikipedia.org/wiki/Calculus
7.3 Visualize 1D Function Optimization 47
# objective function
def objective(x):
return x**2.0
Running the example creates a scatter plot of the objective function. We can see the familiar
shape of the function, but we donŠt gain anything from plotting the points directly. The line
and the smooth interpolation between the points it provides are more useful as we can draw
7.3 Visualize 1D Function Optimization 48
other points on top of the line, such as the location of the optima or the points sampled by an
optimization algorithm.
...
optima_x = 0.0
optima_y = objective(optima_x)
Program 7.9: DeĄne the known function optima
We can then plot this point with any shape or color we like, in this case, a red square.
...
pyplot.plot([optima_x], [optima_y], 's', color='r')
Program 7.10: Draw the function optima as a red square
Tying this together, the complete example of creating a line plot of the function with the optima
highlighted by a point is listed below.
7.3 Visualize 1D Function Optimization 49
# objective function
def objective(x):
return x**2.0
Running the example creates the familiar line plot of the function, and this time, the optima of
the function, i.e. the input that results in the minimum output of the function, is marked with
a red square.
Figure 7.3: Line plot of a one-dimensional function with optima marked by a red square
This is a very simple function and the red square for the optima is easy to see. Sometimes
the function might be more complex, with lots of hills and valleys, and we might want to make
the optima more visible. In this case, we can draw a vertical line across the whole plot.
7.3 Visualize 1D Function Optimization 50
...
pyplot.axvline(x=optima_x, ls='--', color='red')
Program 7.12: Draw a vertical line at the optimal input
# objective function
def objective(x):
return x**2.0
Running the example creates the same plot and this time draws a red line clearly marking the
point in the input space that marks the optima.
Figure 7.4: Line plot of a one-dimensional function with optima marked by a red line
7.3 Visualize 1D Function Optimization 51
...
seed(1)
sample = r_min + rand(10) * (r_max - r_min)
# evaluate the sample
sample_eval = objective(sample)
Program 7.14: Simulate a sample made by an optimization algorithm
We can then plot this sample, in this case using small black circles.
...
pyplot.plot(sample, sample_eval, 'o', color='black')
Program 7.15: Plot the sample as black circles
The complete example of creating a line plot of a function with the optima marked by a red
line and an algorithm sample drawn with small black dots is listed below.
# objective function
def objective(x):
return x**2.0
Running the example creates the line plot of the domain and marks the optima with a red line
as before. This time, the sample from the domain selected by an algorithm (really a random
sample of points) is drawn with black dots. We can imagine that a real optimization algorithm
will show points narrowing in on the domain as it searches down-hill from a starting point.
Figure 7.5: Line Plot of a One-Dimensional Function With Optima Marked by a Red
Line and Samples Shown with Black Dots
Next, letŠs look at how we might perform similar visualizations for the optimization of a
two-dimensional function.
# objective function
def objective(x, y):
return x**2.0 + y**2.0
y = 4.0
result = objective(x, y)
print('f(%.3f, %.3f) = %.3f' % (x, y, result))
Program 7.17: Example of a 2D objective function
Running the example evaluates the point [x = 4, y = 4], which equals 32.
Next, we need a way to sample the domain so that we can, in turn, sample the objective function.
...
# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# summarize some of the input domain
print(x[:5, :5])
Program 7.18: Using meshgrid() to create sample points
We can then evaluate each pair of points using our objective function.
...
# compute targets
results = objective(x, y)
# summarize some of the results
print(results[:5, :5])
Program 7.19: Evaluate targets from each sample point
Finally, we can review the mapping of some of the inputs to their corresponding output values.
3
https://www.mathworks.com/help/matlab/ref/meshgrid.html
4
https://numpy.org/doc/stable/reference/generated/numpy.meshgrid.html
7.4 Visualize 2D Function Optimization 54
...
for i in range(5):
print('f(%.3f, %.3f) = %.3f' % (x[i,0], y[i,0], results[i,0]))
Program 7.20: Create a mapping of some inputs to some results
The example below demonstrates how we can create a uniform sample grid across the two-
dimensional input space and objective function.
# objective function
def objective(x, y):
return x**2.0 + y**2.0
Running the example Ąrst summarizes some points in the mesh grid, then the objective function
evaluation for some points. Finally, we enumerate coordinates in the two-dimensional input
space and their corresponding function evaluation.
Now that we are familiar with how to sample the input space and evaluate points, letŠs look at
how we might plot the function.
...
# create a contour plot with 50 levels and jet color scheme
pyplot.contour(x, y, results, 50, alpha=1.0, cmap='jet')
# show the plot
pyplot.show()
Program 7.22: Creating a contour plot
Tying this together, the complete example of creating a contour plot of the two-dimensional
objective function is listed below.
# objective function
def objective(x, y):
return x**2.0 + y**2.0
Running the example creates the contour plot. We can see that the more curved parts of the
surface around the edges have more contours to show the detail, and the less curved parts of
the surface in the middle have fewer contours. We can see that the lowest part of the domain is
the middle, as expected.
...
pyplot.contourf(x, y, results, levels=50, cmap='jet')
Program 7.24: Create a Ąlled contour plot with 50 levels and jet color scheme
We can also show the optima on the plot, in this case as a white star that will stand out against
the blue background color of the lowest part of the plot.
...
# define the known function optima
optima_x = [0.0, 0.0]
# draw the function optima as a white star
pyplot.plot([optima_x[0]], [optima_x[1]], '*', color='white')
Program 7.25: Draw the function optima as a white star
7
https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.contourf.html
7.4 Visualize 2D Function Optimization 57
Tying this together, the complete example of a Ąlled contour plot with the optima marked is
listed below.
# objective function
def objective(x, y):
return x**2.0 + y**2.0
Running the example creates the Ąlled contour plot that gives a better idea of the shape of the
objective function. The optima at [x = 0, y = 0] is then marked clearly with a white star.
Figure 7.7: Filled Contour Plot of a Two-Dimensional Objective Function With Optima
Marked by a White Star
7.4 Visualize 2D Function Optimization 58
...
seed(1)
sample_x = r_min + rand(10) * (r_max - r_min)
sample_y = r_min + rand(10) * (r_max - r_min)
Program 7.27: Simulate a sample made by an optimization algorithm
These points can then be plotted directly as black circles and their context color can give an
idea of their relative quality.
...
# plot the sample as black circles
pyplot.plot(sample_x, sample_y, 'o', color='black')
Program 7.28: Plot the sample as black circles
Tying this together, the complete example of a Ąlled contour plot with optimal and input sample
plotted is listed below.
# objective function
def objective(x, y):
return x**2.0 + y**2.0
Running the example, we can see the Ąlled contour plot as before with the optima marked. We
can now see the sample drawn as black dots and their surrounding color and relative distance to
the optima gives an idea of how close the algorithm (random points in this case) got to solving
the problem.
Figure 7.8: Filled Contour Plot of a Two-Dimensional Objective Function With Optima
and Input Sample Marked
...
fig, ax = pyplot.subplots(subplot_kw={"projection": "3d"})
ax.plot_surface(x, y, results, cmap='jet')
Program 7.30: Create a surface plot with the jet color scheme
8
https://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html#mpl_toolkits.mplot3d.Axes3D.plot_s
urface
7.4 Visualize 2D Function Optimization 60
# objective function
def objective(x, y):
return x**2.0 + y**2.0
Running the example creates a three-dimensional surface plot of the objective function.
Additionally, the plot is interactive, meaning that you can use the mouse to drag the
perspective on the surface around and view it from different angles.
APIs
Optimization and root Ąnding (scipy.optimize).
https://docs.scipy.org/doc/scipy/reference/optimize.html
Optimization (scipy.optimize).
https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
numpy.meshgrid API.
https://numpy.org/doc/stable/reference/generated/numpy.meshgrid.html
matplotlib.pyplot.contour API.
https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.contour.html
matplotlib.pyplot.contourf API.
https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.contourf.html
mpl_toolkits.mplot3d.Axes3D.plot_surface API.
https://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html%5C#mpl_toolkits.
mplot3d.Axes3D.plot_surface
Articles
Mathematical optimization. Wikipedia.
https://en.wikipedia.org/wiki/Mathematical_optimization
7.6 Summary 62
Parabola. Wikipedia.
https://en.wikipedia.org/wiki/Parabola
7.6 Summary
In this tutorial, you discovered how to create visualizations for function optimization in Python.
SpeciĄcally, you learned:
▷ Visualization is an important tool when studying function optimization algorithms.
▷ How to visualize one-dimensional functions and samples using line plots.
▷ How to visualize two-dimensional functions and samples using contour and surface plots.
Next, you will learn about the use of randomness in optimization.
Stochastic Optimization
Algorithms
8
Stochastic optimization refers to the use of randomness in the objective function or in the
optimization algorithm. Challenging optimization algorithms, such as high-dimensional nonlinear
objective problems, may contain multiple local optima in which deterministic optimization
algorithms may get stuck. Stochastic optimization algorithms provide an alternative approach
that permits less optimal local decisions to be made within the search procedure that may
increase the probability of the procedure locating the global optima of the objective function.
In this tutorial, you will discover a gentle introduction to stochastic optimization. After
completing this tutorial, you will know:
▷ Stochastic optimization algorithms make use of randomness as part of the search
procedure.
▷ Examples of stochastic optimization algorithms like simulated annealing and genetic
algorithms.
▷ Practical considerations when using stochastic optimization algorithms such as repeated
evaluations.
LetŠs get started.
“ noise in the measurements provided to the algorithm and/or there is injected (Monte
Carlo) randomness in the algorithm itself.
Ů Page xiii, Introduction to Stochastic Search and Optimization, 2003.
Randomness in the objective function means that the evaluation of candidate solutions
”
involves some uncertainty or noise and algorithms must be chosen that can make progress in
the search in the presence of this noise. Randomness in the algorithm is used as a strategy,
e.g. stochastic or probabilistic decisions. It is used as an alternative to deterministic decisions
in an effort to improve the likelihood of locating the global optima or a better local optima.
Standard stochastic optimization methods are brittle, sensitive to stepsize choice and
“ noise and coping with models or systems that are highly nonlinear, high dimensional,
or otherwise inappropriate for classical deterministic methods of optimization.
Ů ŞStochastic OptimizationŤ, 2011.
Using randomness in an optimization algorithm allows the search procedure to perform
”
well on challenging optimization problems that may have a nonlinear response surface. This is
achieved by the algorithm taking locally suboptimal steps or moves in the search space that
allow it to escape local optima.
Randomness can help escape local optima and increase the chances of Ąnding a
“ global optimum.
Ů Page 8, Algorithms for Optimization, 2019. ”
The randomness used in a stochastic optimization algorithm does not have to be true
randomness; instead, pseudorandom is sufficient. A pseudorandom number generator is almost
universally used in stochastic optimization. Use of randomness in a stochastic optimization
algorithm does not mean that the algorithm is random. Instead, it means that some decisions
made during the search procedure involve some portion of randomness. For example, we can
conceptualize this as the move from the current to the next point in the search space made by
the algorithm may be made according to a probability distribution relative to the optimal move.
Now that we have an idea of what stochastic optimization is, letŠs look at some examples
of stochastic optimization algorithms.
8.3 Stochastic Optimization Algorithms 65
This idea of multiple runs of the algorithm can be used in two key situations:
▷ Comparing Algorithms
▷ Evaluating Final Result
Algorithms may be compared based on the relative quality of the result found, the number
of function evaluations performed, or some combination or derivation of these considerations.
The result of any one run will depend upon the randomness used by the algorithm and alone
cannot meaningfully represent the capability of the algorithm. Instead, a strategy of repeated
evaluation should be used. Any comparison between stochastic optimization algorithms will
require the repeated evaluation of each algorithm with a different source of randomness and
the summarization of the probability distribution of best results found, such as the mean and
standard deviation of objective values. The mean result from each algorithm can then be
compared.
In cases where multiple local minima are likely to exist, it can be beneĄcial to
“ incorporate random restarts after our terminiation conditions are met where we
restart our local descent method from randomly selected initial points.
Ů Page 66, Algorithms for Optimization, 2019.
Similarly, any single run of a chosen optimization algorithm alone does not meaningfully
”
represent the global optima of the objective function. Instead, a strategy of repeated evaluation
should be used to develop a distribution of optimal solutions. The maximum or minimum of the
distribution can be taken as the Ąnal solution, and the distribution itself will provide a point of
reference and conĄdence that the solution found is Şrelatively goodŤ or Şgood enoughŤ given the
resources expended.
▷ Multi-Restart: An approach for improving the likelihood of locating the global optima
via the repeated application of a stochastic optimization algorithm to an optimization
problem.
The repeated application of a stochastic optimization algorithm on an objective function is
sometimes referred to as a multi-restart strategy and may be built in to the optimization
algorithm itself or prescribed more generally as a procedure around the chosen stochastic
optimization algorithm.
Each time you do a random restart, the hill-climber then winds up in some (possibly
Papers
Hilal Asi and John C. Duchi. ŞThe importance of better models in stochastic optimizationŤ. In:
Proceedings of the National Academy of Sciences. Vol. 116. 46. 2019, pp. 22924Ű22930.
https://www.pnas.org/content/116/46/22924
8.6 Summary 67
Books
James C. Spall. Introduction to Stochastic Search and Optimization. Wiley-Interscience, 2003.
https://amzn.to/34JYN7m
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
Sean Luke. Essentials of Metaheuristics. lulu.com, 2011.
https://amzn.to/3lHryZr
Articles
Stochastic optimization. Wikipedia.
https://en.wikipedia.org/wiki/Stochastic_optimization
Heuristic (computer science). Wikipedia.
https://en.wikipedia.org/wiki/Heuristic_(computer_science)
Metaheuristic. Wikipedia.
https://en.wikipedia.org/wiki/Metaheuristic
8.6 Summary
In this tutorial, you discovered a gentle introduction to stochastic optimization. SpeciĄcally,
you learned:
▷ Stochastic optimization algorithms make use of randomness as part of the search
procedure.
▷ Examples of stochastic optimization algorithms like simulated annealing and genetic
algorithms.
▷ Practical considerations when using stochastic optimization algorithms such as repeated
evaluations.
Next, you will learn about two naive optimization algorithms, the random search and grid
search.
Random Search and Grid Search
9
Function optimization requires the selection of an algorithm to efficiently sample the search
space and locate a good or best solution. There are many algorithms to choose from, although
it is important to establish a baseline for what types of solutions are feasible or possible for a
problem. This can be achieved using a naive optimization algorithm, such as a random search
or a grid search.
The results achieved by a naive optimization algorithm are computationally efficient to
generate and provide a point of comparison for more sophisticated optimization algorithms.
Sometimes, naive algorithms are found to achieve the best performance, particularly on those
problems that are noisy or non-smooth and those problems where domain expertise typically
biases the choice of optimization algorithm.
In this tutorial, you will discover naive algorithms for function optimization. After
completing this tutorial, you will know:
▷ The role of naive algorithms in function optimization projects.
▷ How to generate and evaluate a random search for function optimization.
▷ How to generate and evaluate a grid search for function optimization.
LetŠs get started.
1
https://en.wikipedia.org/wiki/Brute-force_search
2
https://en.wikipedia.org/wiki/Random_search
9.3 Random Search for Function Optimization 70
function and generates then evaluates a random sample of 100 inputs. The input with the best
performance is then reported.
# objective function
def objective(x):
return x**2.0
Running the example generates a random sample of input values, which are then evaluated.
The best performing point is then identiĄed and reported.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the result is very close to the optimal input of 0.0.
We can update the example to plot the objective function and show the sample and best result.
The complete example is listed below.
# objective function
def objective(x):
return x**2.0
Running the example again generates the random sample and reports the best result.
A line plot is then created showing the shape of the objective function, the random sample,
and a red line for the best result located from the sample.
9.4 Grid Search for Function Optimization 72
Figure 9.1: Line Plot of One-Dimensional Objective Function With Random Sample
“ space. This approach is easy to implement, does not rely on randomness, and covers
the space, but it uses a large number of points.
Ů Page 235, Algorithms for Optimization, 2019.
Like random search, a grid search can be particularly effective on problems where domain
”
expertise is typically used to inĆuence the selection of speciĄc optimization algorithms. The
grid can help to quickly identify areas of a search space that may deserve more attention.
The grid of samples is typically uniform, although this does not have to be the case. For
example, a log-10 scale could be used with a uniform spacing, allowing sampling to be performed
across orders of magnitude. The downside is that the coarseness of the grid may step over
whole regions of the search space where good solutions reside, a problem that gets worse as the
number of inputs (dimensions of the search space) to the problem increases.
A grid of samples can be generated by choosing the uniform separation of points, then
enumerating each variable in turn and incrementing each variable by the chosen separation. The
example below gives an example of a simple two-dimensional minimization objective function
and generates then evaluates a grid sample with a spacing of 0.1 for both input variables. The
input with the best performance is then reported.
9.4 Grid Search for Function Optimization 73
# objective function
def objective(x, y):
return x**2.0 + y**2.0
Running the example generates a grid of input values, which are then evaluated. The best
performing point is then identiĄed and reported.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the result Ąnds the optima exactly.
We can update the example to plot the objective function and show the sample and best
result. The complete example is listed below.
# objective function
def objective(x, y):
9.4 Grid Search for Function Optimization 74
Running the example again generates the grid sample and reports the best result.
A contour plot is then created showing the shape of the objective function, the grid sample
as black dots, and a white star for the best result located from the sample. Note that some of
the black dots for the edge of the domain appear to be off the plot; this is just an artifact for
how we are choosing to draw the dots (e.g. not centered on the sample).
9.5 Further Reading 75
Figure 9.2: Contour Plot of One-Dimensional Objective Function With Grid Sample
Books
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
Articles
Random search. Wikipedia.
https://en.wikipedia.org/wiki/Random_search
Hyperparameter optimization. Wikipedia.
https://en.wikipedia.org/wiki/Hyperparameter_optimization
Brute-force search. Wikipedia.
https://en.wikipedia.org/wiki/Brute-force_search
9.6 Summary
In this tutorial, you discovered naive algorithms for function optimization. SpeciĄcally, you
learned:
▷ The role of naive algorithms in function optimization projects.
▷ How to generate and evaluate a random search for function optimization.
▷ How to generate and evaluate a grid search for function optimization.
9.6 Summary 76
changes at the point x. It might change a lot, i.e. be very curved, or might change a little,
i.e. slight curve, or it might not change at all, i.e. Ćat or stationary.
A function is differentiable2 if we can calculate the derivative at all points of input for the
function variables. Not all functions are differentiable. Once we calculate the derivative, we
can use it in a number of ways. For example, given an input value x and the derivative at that
point f ′ (x), we can estimate the value of the function f (x) at a nearby point ∆x (change in x)
using the derivative, as follows:
Here, we can see that f ′ (x) is a line and we are estimating the value of the function at a nearby
point by moving along the line by ∆x. We can use derivatives in optimization problems as they
tell us how to change inputs to the target function in a way that increases or decreases the
output of the function, so we can get closer to the minimum or maximum of the function.
Derivatives are useful in optimization because they provide information about how
“ type of limit [. . . ] This special type of limit is called a derivative and we will see
that it can be interpreted as a rate of change in any of the sciences or engineering.
Ů Page 104, Calculus, 2013.
An example of the tangent line of a point for a function is provided below, taken from page 19
”
of ŞAlgorithms for Optimization.Ť
Figure 10.1: Tangent Line of a Function at a Given Point. Taken from Algorithms for
Optimization.
2
https://en.wikipedia.org/wiki/Differentiable_function
10.3 What is a Gradient? 80
Technically, the derivative described so far is called the Ąrst derivative or Ąrst-order
derivative. The second derivative3 (or second-order derivative) is the derivative of the derivative
function. That is, the rate of change of the rate of change or how much the change in the
function changes.
▷ First Derivative: Rate of change of the target function.
▷ Second Derivative: Rate of change of the Ąrst derivative function.
A natural use of the second derivative is to approximate the Ąrst derivative at a nearby point,
just as we can use the Ąrst derivative to estimate the value of the target function at a nearby
point.
Now that we know what a derivative is, letŠs take a look at a gradient.
“ captures the local slope of the function, allowing us to predict the effect of taking a
small step from a point in any direction.
Ů Page 21, Algorithms for Optimization, 2019.
Multiple input variables together deĄne a vector of values, i.e. a point in the input space
”
that can be provided to the target function. The derivative of a target function with a vector of
input variables similarly is a vector. This vector of derivatives for each input variable is the
gradient.
▷ Gradient (vector calculus): A vector of derivatives for a function that takes a vector of
input variables.
You might recall from high school algebra or pre-calculus, the gradient also refers generally to
the slope of a line on a two-dimensional plot. It is calculated as the rise (change on the y-axis)
of the function divided by the run (change in x-axis) of the function, simpliĄed to the rule, Şrise
over runŞ:
▷ Gradient (algebra): Slope of a line, calculated as rise over run.
We can see that this is a simple and rough approximation of the derivative for a function with
one variable. The derivative function from calculus is more precise as it uses limits to Ąnd
the exact slope of the function at a point. This idea of gradient from algebra is related, but
not directly useful to the idea of a gradient as used in optimization and machine learning. A
function that takes multiple input variables, i.e. a vector of input variables, may be referred to
as a multivariate function.
3
https://en.wikipedia.org/wiki/Second_derivative
4
https://en.wikipedia.org/wiki/Gradient
10.4 Worked Example of Calculating Derivatives 81
# objective function
def objective(x):
return x**2.0
5
https://en.wikipedia.org/wiki/Hessian_matrix
10.4 Worked Example of Calculating Derivatives 82
Running the example creates a line plot of the inputs to the function (x-axis) and the
calculated output of the function (y-axis). We can see the familiar U-shaped called a parabola.
We can see a large change or steep curve on the sides of the shape where we would expect
a large derivative and a Ćat area in the middle of the function where we would expect a small
derivative.
LetŠs conĄrm these expectations by calculating the derivative at −0.5 and 0.5 (steep) and
0.0 (Ćat). The derivative for the function is calculated as follows:
f ′ (x) = 2 × x
The example below calculates the derivatives for the speciĄc input points for our objective
function.
# calculate derivatives
d1 = derivative(-0.5)
print("f'(-0.5) = %.3f" % d1)
d2 = derivative(0.5)
print("f'(0.5) = %.3f" % d2)
d3 = derivative(0.0)
print("f'(0.0) = %.3f" % d3)
Program 10.2: Calculate the derivative of the objective function
10.5 How to Interpret the Derivative 83
Running the example prints the derivative values for speciĄc input values. We can see that
the derivative at the steep points of the function is −1 and 1 and the derivative for the Ćat part
of the function is 0.0.
f'(-0.5) = -1.000
f'(0.5) = 1.000
f'(0.0) = 0.000
Output 10.1: Result from Program 10.2
Now that we know how to calculate derivatives of a function, letŠs look at how we might
interpret the derivative values.
Not all functions are differentiable, and some functions that are differentiable may make it
difficult to Ąnd the derivative with some methods. Calculating the derivative of a function is
beyond the scope of this tutorial. Consult a good calculus textbook, such as those in the further
reading section.
Books
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
James Stewart. Calculus. 8th ed. Cengage Learning, 2013.
https://amzn.to/3kS9I52
Gilbert Strang. Calculus. 3rd ed. Wellesley-Cambridge Press, 2017.
https://amzn.to/3fqNSEB
(The 1991 edition is available online: https://ocw.mit.edu/resources/res- 18- 001-
calculus-online-textbook-spring-2005/textbook/).
Articles
Derivative. Wikipedia.
https://en.wikipedia.org/wiki/Derivative
Second derivative. Wikipedia.
https://en.wikipedia.org/wiki/Second_derivative
Partial derivative. Wikipedia.
https://en.wikipedia.org/wiki/Partial_derivative
Gradient. Wikipedia.
https://en.wikipedia.org/wiki/Gradient
Differentiable function. Wikipedia.
https://en.wikipedia.org/wiki/Differentiable_function
Jacobian matrix and determinant. Wikipedia.
https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant
Hessian matrix. Wikipedia.
https://en.wikipedia.org/wiki/Hessian_matrix
10.8 Summary
In this tutorial, you discovered a gentle introduction to the derivative and the gradient in
machine learning. SpeciĄcally, you learned:
▷ The derivative of a function is the change of the function for a given input.
▷ The gradient is simply a derivative vector for a multivariate function.
▷ How to calculate and interpret derivatives of a simple function.
Next, we will start with the simplest example of optimizing a function with one variable.
Univariate Function
Optimization
11
How to Optimize a Function with One Variable?
Univariate function optimization involves Ąnding the input to a function that results in the
optimal output from an objective function. This is a common procedure in machine learning
when Ątting a model with one parameter or tuning a model that has a single hyperparameter.
An efficient algorithm is required to solve optimization problems of this type that will Ąnd the
best solution with the minimum number of evaluations of the objective function, given that
each evaluation of the objective function could be computationally expensive, such as Ątting
and evaluating a model on a dataset. This excludes expensive grid search and random search
algorithms and in favor of efficient algorithms like BrentŠs method.
In this tutorial, you will discover how to perform univariate function optimization in Python.
After completing this tutorial, you will know:
▷ Univariate function optimization involves Ąnding an optimal input for an objective
function that takes a single continuous argument.
▷ How to perform univariate function optimization for an unconstrained convex function.
▷ How to perform univariate function optimization for an unconstrained non-convex
function.
LetŠs get started.
“ Ąnding algorithm that combines elements of the secant method and inverse quadratic
interpolation. It has reliable and fast convergence properties, and it is the univariate
optimization algorithm of choice in many popular numerical optimization packages.
Ů Pages 49Ű51, Algorithms for Optimization, 2019.
Bisecting algorithms use a bracket (lower and upper) of input values and split up the input
”
domain, bisecting it in order to locate where in the domain the optima is located, much like a
binary search. DekkerŠs method is one way this is achieved efficiently for a continuous domain.
DekkerŠs method gets stuck on non-convex problems. BrentŠs method modiĄes DekkerŠs method
1
https://en.wikipedia.org/wiki/Convex_function
2
https://en.wikipedia.org/wiki/Brent%27s_method
3
https://en.wikipedia.org/wiki/Inverse_quadratic_interpolation
11.3 Convex Univariate Function Optimization 88
to avoid getting stuck and also approximates the second derivative of the objective function
(called the Secant Method4 ) in an effort to accelerate the search. As such, BrentŠs method
for univariate function optimization is generally preferred over most other univariate function
optimization algorithms given its efficiency.
BrentŠs method is available in Python via the minimize_scalar() SciPy function5 that
takes the name of the function to be minimized. If your target function is constrained to a
range, it can be speciĄed via the ŞboundsŤ argument. It returns an OptimizeResult object6
that is a dictionary containing the solution. Importantly, the ŚxŠ key summarizes the input
for the optima, the ŚfunŠ key summarizes the function output for the optima, and the ŚnfevŠ
summarizes the number of evaluations of the target function that were performed.
...
result = minimize_scalar(objective, method='brent')
Program 11.1: Minimize the objective function
Now that we know how to perform univariate function optimization in Python, letŠs look at
some examples.
def objective(x):
return (5.0 + x)**2.0
Program 11.2: Objective function
We can plot a coarse grid of this function with input values from −10 to 10 to get an idea
of the shape of the target function. The complete example is listed below.
# objective function
def objective(x):
return (5.0 + x)**2.0
# define range
r_min, r_max = -10.0, 10.0
# prepare inputs
4
https://en.wikipedia.org/wiki/Secant_method
5
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize_scalar.html
6
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.OptimizeResult.html
7
https://en.wikipedia.org/wiki/Parabola
11.3 Convex Univariate Function Optimization 89
Running the example evaluates input values in our speciĄed range using our target function
and creates a plot of the function inputs to function outputs. We can see the U-shape of the
function and that the objective is at −5.0.
...
result = minimize_scalar(objective, method='brent')
Program 11.4: Minimize the function
Once optimized, we can summarize the result, including the input and evaluation of the
optima and the number of function evaluations required to locate the optima.
11.3 Convex Univariate Function Optimization 90
...
opt_x, opt_y = result['x'], result['fun']
print('Optimal Input x: %.6f' % opt_x)
print('Optimal Output f(x): %.6f' % opt_y)
print('Total Evaluations n: %d' % result['nfev'])
Program 11.5: Summarize the result
Finally, we can plot the function again and mark the optima to conĄrm it was located in
the place we expected for this function.
...
# define the range
r_min, r_max = -10.0, 10.0
# prepare inputs
inputs = arange(r_min, r_max, 0.1)
# compute targets
targets = [objective(x) for x in inputs]
# plot inputs vs target
pyplot.plot(inputs, targets, '--')
# plot the optima
pyplot.plot([opt_x], [opt_y], 's', color='r')
# show the plot
pyplot.show()
Program 11.6: Plot the function and mark the optima
# objective function
def objective(x):
return (5.0 + x)**2.0
Running the example Ąrst solves the optimization problem and reports the result.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the optima was located after 10 evaluations of the objective
function with an input of −5.0, achieving an objective function value of 0.0.
A plot of the function is created again and this time, the optima is marked as a red square.
Figure 11.2: Line Plot of a Convex Objective Function with Optima Marked
multiple hills and valleys can cause the search to get stuck and report a false or local optima
instead. We can deĄne a non-convex univariate function as follows.
def objective(x):
return (x - 2.0) * x * (x + 2.0)**2.0
Program 11.8: Objective function
We can sample this function and create a line plot of input values to objective values. The
complete example is listed below.
# objective function
def objective(x):
return (x - 2.0) * x * (x + 2.0)**2.0
# define range
r_min, r_max = -3.0, 2.5
# prepare inputs
inputs = arange(r_min, r_max, 0.1)
# compute targets
targets = [objective(x) for x in inputs]
# plot inputs vs target
pyplot.plot(inputs, targets, '--')
pyplot.show()
Program 11.9: Plot a non-convex univariate function
Running the example evaluates input values in our speciĄed range using our target function
and creates a plot of the function inputs to function outputs. We can see a function with one
false optima around −2.0 and a global optima around 1.2.
Next, we can use the optimization algorithm to Ąnd the optima. As before, we can call the
minimize_scalar() function8 to optimize the function, then summarize the result and plot the
optima on a line plot.
The complete example of optimization of an unconstrained non-convex univariate function
is listed below.
# objective function
def objective(x):
return (x - 2.0) * x * (x + 2.0)**2.0
Running the example Ąrst solves the optimization problem and reports the result. In this
case, we can see that the optima was located after 15 evaluations of the objective function with
an input of about 1.28, achieving an objective function value of about −9.91.
8
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize_scalar.html
11.5 Further Reading 94
A plot of the function is created again, and this time, the optima is marked as a red square.
We can see that the optimization was not deceived by the false optima and successfully located
the global optima.
Figure 11.4: Line Plot of a Non-Convex Objective Function with Optima Marked
Books
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
APIs
Optimization (scipy.optimize).
https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
Optimization and root Ąnding (scipy.optimize).
https://docs.scipy.org/doc/scipy/reference/optimize.html
scipy.optimize.minimize_scalar API.
https : / / docs . scipy . org / doc / scipy / reference / generated / scipy . optimize .
minimize_scalar.html
Articles
BrentŠs method. Wikipedia.
https://en.wikipedia.org/wiki/Brent%5C%27s_method
11.6 Summary 95
11.6 Summary
In this tutorial, you discovered how to perform univariate function optimization in Python.
SpeciĄcally, you learned:
▷ Univariate function optimization involves Ąnding an optimal input for an objective
function that takes a single continuous argument.
▷ How to perform univariate function optimization for an unconstrained convex function.
▷ How to perform univariate function optimization for an unconstrained non-convex
function.
Next, you will be introduced to an optimization algorithm that do not need to use a
gradient.
Pattern Search: The Nelder-Mead
Optimization Algorithm
12
The Nelder-Mead optimization algorithm is a widely used approach for non-differentiable
objective functions. As such, it is generally referred to as a pattern search algorithm and is
used as a local or global search procedure, challenging nonlinear and potentially noisy and
multimodal function optimization problems.
In this tutorial, you will discover the Nelder-Mead optimization algorithm. After completing
this tutorial, you will know:
▷ The Nelder-Mead optimization algorithm is a type of pattern search that does not use
function gradients.
▷ How to apply the Nelder-Mead algorithm for function optimization in Python.
▷ How to interpret the results of the Nelder-Mead algorithm on noisy and multimodal
objective functions.
LetŠs get started.
“ stagnation has been observed to occur at nonoptimal points. Restarting can be used
when stagnation is detected.
Ů Page 239, Numerical Optimization, 2006.
A starting point must be provided to the algorithm, which may be the endpoint of another
”
global optimization algorithm or a random point drawn from the domain. Given that the
algorithm may get stuck, it may beneĄt from multiple restarts with different starting points.
The Nelder-Mead simplex method uses a simplex to traverse the space in search of
“ a minimum.
Ů Page 105, Algorithms for Optimization, 2019). ”
The algorithm works by using a shape structure (called a simplex) composed of n + 1
points (vertices), where n is the number of input dimensions to the function. For example, on
a two-dimensional problem that may be plotted as a surface, the shape structure would be
composed of three points represented as a triangle.
The Nelder-Mead method uses a series of rules that dictate how the simplex is
“ with the worst function value and replace it with another point with a better value.
The new point is obtained by reĆecting, expanding, or contracting the simplex along
the line joining the worst vertex with the centroid of the remaining vertices. If we
1
https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
2
https://en.wikipedia.org/wiki/John_Nelder
3
https://en.wikipedia.org/wiki/Roger_Mead
4
https://academic.oup.com/comjnl/article-abstract/7/4/308/354237
5
https://en.wikipedia.org/wiki/Pattern_search_(optimization)
12.3 Nelder-Mead Example in Python 98
cannot Ąnd a better point in this manner, we retain only the vertex with the best
function value, and we shrink the simplex by moving all other vertices toward this
value.
Ů Page 238, Numerical Optimization, 2006.
The search stops when the points converge on an optimum, when a minimum difference
”
between evaluations is observed, or when a maximum number of function evaluations are
performed. Now that we have a high-level idea of how the algorithm works, letŠs look at how we
might use it in practice.
...
result = minimize(objective, pt, method='nelder-mead')
Program 12.1: Perform the search
The result is an OptimizeResult7 object that contains information about the result of the
optimization accessible via keys. For example, the ŞsuccessŤ boolean indicates whether the
search was completed successfully or not, the ŞmessageŤ provides a human-readable message
about the success or failure of the search, and the ŞnfevŤ key indicates the number of function
evaluations that were performed. Importantly, the ŞxŤ key speciĄes the input values that
indicate the optima found by the search, if successful.
...
print('Status : %s' % result['message'])
print('Total Evaluations: %d' % result['nfev'])
print('Solution: %s' % result['x'])
Program 12.2: Summarize the result
def objective(x):
return x[0]**2.0 + x[1]**2.0
Program 12.3: Objective function
6
https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html
7
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.OptimizeResult.html
12.3 Nelder-Mead Example in Python 99
We will use a random point in the deĄned domain as a starting point for the search.
...
r_min, r_max = -5.0, 5.0
# define the starting point as a random sample from the domain
pt = r_min + rand(2) * (r_max - r_min)
Program 12.4: DeĄne range for input
The search can then be performed. We use the default maximum number of function evaluations
set via the ŞmaxiterŤ and set to N*200, where N is the number of input variables, which is two
in this case, i.e. 400 evaluations.
...
result = minimize(objective, pt, method='nelder-mead')
Program 12.5: Perform the search
After the search is Ąnished, we will report the total function evaluations used to Ąnd the optima
and the success message of the search, which we expect to be positive in this case.
...
print('Status : %s' % result['message'])
print('Total Evaluations: %d' % result['nfev'])
Program 12.6: Summarize the result
Finally, we will retrieve the input values for located optima, evaluate it using the objective
function, and report both in a human-readable manner.
...
solution = result['x']
evaluation = objective(solution)
print('Solution: f(%s) = %.5f' % (solution, evaluation))
Program 12.7: Evaluate solution
Tying this together, the complete example of using the Nelder-Mead optimization algorithm on
a simple convex objective function is listed below.
# objective function
def objective(x):
return x[0]**2.0 + x[1]**2.0
Running the example executes the optimization, then reports the results.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the search was successful, as we expected, and was completed
after 88 function evaluations. We can see that the optima was located with inputs very close to
[0, 0], which evaluates to the minimum objective value of 0.0.
Now that we have seen how to use the Nelder-Mead optimization algorithm successfully, letŠs
look at some examples where it does not perform so well.
def objective(x):
return (x + randn(len(x))*0.3)**2.0
Program 12.9: Objective function
8
https://numpy.org/doc/stable/reference/random/generated/numpy.random.randn.html
12.4 Nelder-Mead on Challenging Functions 101
The noise will make the function challenging to optimize for the algorithm and it will very likely
not locate the optima at x = 0.0. The complete example of using Nelder-Mead to optimize the
noisy objective function is listed below.
# objective function
def objective(x):
return (x + randn(len(x))*0.3)**2.0
Running the example executes the optimization, then reports the results.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, the algorithm does not converge and instead uses the maximum number of
function evaluations, which is 200.
The algorithm may converge on some runs of the code but will arrive on a point away from the
optima.
an equivalent function evaluation, or a single global optima and multiple local optima where
algorithms like the Nelder-Mead can get stuck in search of the local optima.
The Ackley function9 is an example of the latter. It is a two-dimensional objective function
that has a global optima at [0, 0] but has many local optima. The example below implements
the Ackley and creates a three-dimensional plot showing the global optima and multiple local
optima.
# objective function
def objective(x, y):
❈
return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi ❈
* x) + cos(2 * pi * y))) + e + 20
Running the example creates the surface plot of the Ackley function showing the vast number
of local optima.
9
https://en.wikipedia.org/wiki/Ackley_function
12.4 Nelder-Mead on Challenging Functions 103
We would expect the Nelder-Mead function to get stuck in one of the local optima while in
search of the global optima. Initially, when the simplex is large, the algorithm may jump over
many local optima, but as it contracts, it will get stuck. We can explore this with the example
below that demonstrates the Nelder-Mead algorithm on the Ackley function.
# objective function
def objective(v):
x, y = v
❈
return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi ❈
* x) + cos(2 * pi * y))) + e + 20
Running the example executes the optimization, then reports the results.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the search completed successfully but did not locate the global
optima. It got stuck and found a local optima. Each time we run the example, we will Ąnd a
different local optima given the different random starting point for the search.
Papers
J. A. Nelder and R. Mead. ŞA Simplex Method for Function MinimizationŤ. The Computer
Journal, 7(4), 1965, pp. 308Ű313.
https://academic.oup.com/comjnl/article-abstract/7/4/308/354237
Books
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
Jorge Nocedal and Stephen Wright. Numerical Optimization. 2nd ed. Springer, 2006.
https://amzn.to/3sbjF2t
APIs
Nelder-Mead Simplex algorithm (method='Nelder-Mead').
https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html%5C#nelder-
mead-simplex-algorithm-method-nelder-mead
12.6 Summary 105
scipy.optimize.minimize API.
https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html
scipy.optimize.OptimizeResult API.
https : / / docs . scipy . org / doc / scipy / reference / generated / scipy . optimize .
OptimizeResult.html
numpy.random.randn API.
https://numpy.org/doc/stable/reference/random/generated/numpy.random.randn.
html
Articles
Nelder-Mead method. Wikipedia.
https://en.wikipedia.org/wiki/Nelder%5C%E2%5C%80%5C%93Mead_method
Nelder-Mead algorithm. Scholarpedia.
http://www.scholarpedia.org/article/Nelder-Mead_algorithm
12.6 Summary
In this tutorial, you discovered the Nelder-Mead optimization algorithm. SpeciĄcally, you
learned:
▷ The Nelder-Mead optimization algorithm is a type of pattern search that does not use
function gradients.
▷ How to apply the Nelder-Mead algorithm for function optimization in Python.
▷ How to interpret the results of the Nelder-Mead algorithm on noisy and multimodal
objective functions.
Next, you will learn about an algorithm that uses second-order derivatives.
13
Second Order: The BFGS and
L-BFGS-B Optimization
Algorithms
The Broyden, Fletcher, Goldfarb, and Shanno, or BFGS Algorithm, is a local search optimization
algorithm. It is a type of second-order optimization algorithm, meaning that it makes use of the
second-order derivative of an objective function and belongs to a class of algorithms referred
to as Quasi-Newton methods that approximate the second derivative (called the Hessian) for
optimization problems where the second derivative cannot be calculated. The BFGS algorithm
is perhaps one of the most widely used second-order algorithms for numerical optimization and
is commonly used to Ąt machine learning algorithms such as the logistic regression algorithm.
In this tutorial, you will discover the BFGS second-order optimization algorithm. After
completing this tutorial, you will know:
▷ Second-order optimization algorithms are algorithms that make use of the second-order
derivative, called the Hessian matrix for multivariate objective functions.
▷ The BFGS algorithm is perhaps the most popular second-order algorithm for numerical
optimization and belongs to a group called Quasi-Newton methods.
▷ How to minimize objective functions using the BFGS and L-BFGS-B algorithms in
Python.
LetŠs get started.
second derivative of the objective function. You may recall from calculus that the Ąrst derivative1
of a function is the rate of change or curvature of the function at a speciĄc point. The derivative
can be followed downhill (or uphill) by an optimization algorithm toward the minima of the
function (the input values that result in the smallest output of the objective function).
Algorithms that make use of the Ąrst derivative are called Ąrst-order optimization algorithms.
An example of a Ąrst-order algorithm is the gradient descent optimization algorithm.
▷ First-Order Methods: Optimization algorithms that make use of the Ąrst-order
derivative to Ąnd the optima of an objective function.
The second-order derivative2 is the derivative of the derivative, or the rate of change of the rate
of change. The second derivative can be followed to more efficiently locate the optima of the
objective function. This makes sense more generally, as the more information we have about the
objective function, the easier it may be to optimize it. The second-order derivative allows us to
know both which direction to move (like the Ąrst-order) but also estimate how far to move in
that direction, called the step size.
Second-order information, on the other hand, allows us to make a quadratic
“ approximation of the objective function and approximate the right step size to
reach a local minimum . . .
Ů Page 87, Algorithms for Optimization, 2019.
Algorithms that make use of the second-order derivative are referred to as second-order
”
optimization algorithms.
▷ Second-Order Methods: Optimization algorithms that make use of the second-order
derivative to Ąnd the optima of an objective function.
An example of a second-order optimization algorithm is NewtonŠs method. When an objective
function has more than one input variable, the input variables together may be thought of as a
vector, which may be familiar from linear algebra.
The gradient is the generalization of the derivative to multivariate functions. It
“ captures the local slope of the function, allowing us to predict the effect of taking a
small step from a point in any direction.
Ů Page 21, Algorithms for Optimization, 2019.
Similarly, the Ąrst derivative of multiple input variables may also be a vector, where each
”
element is called a partial derivative. This vector of partial derivatives is referred to as the
gradient.
▷ Gradient: Vector of partial Ąrst derivatives for multiple input variables of an objective
function.
This idea generalizes to the second-order derivatives of the multivariate inputs, which is a matrix
containing the second derivatives called the Hessian matrix.
▷ Hessian: Matrix of partial second-order derivatives for multiple input variables of an
objective function.
1
https://en.wikipedia.org/wiki/Derivative
2
https://en.wikipedia.org/wiki/Second_derivative
13.3 BFGS Optimization Algorithm 108
The Hessian matrix is square and symmetric if the second derivatives are all continuous at the
point where we are calculating the derivatives. This is often the case when solving real-valued
optimization problems and an expectation when using many second-order methods.
The Hessian of a multivariate function is a matrix containing all of the second
“ derivatives with respect to the input. The second derivatives capture information
about the local curvature of the function.
Ů Page 21, Algorithms for Optimization, 2019.
As such, it is common to describe second-order optimization algorithms making use of or
”
following the Hessian to the optima of the objective function. Now that we have a high-level
understanding of second-order optimization algorithms, letŠs take a closer look at the BFGS
algorithm.
“ optimization. They are incorporated in many software libraries, and they are
effective in solving a wide variety of small to midsize problems, in particular when
the Hessian is hard to compute.
Ů Page 411, Linear and Nonlinear Optimization, 2009. ”
The main difference between different Quasi-Newton optimization algorithms is the speciĄc
way in which the approximation of the inverse Hessian is calculated. The BFGS algorithm is
one speciĄc way for updating the calculation of the inverse Hessian, instead of recalculating it
every iteration. It, or its extensions, may be one of the most popular Quasi-Newton or even
second-order optimization algorithms used for numerical optimization.
The most popular quasi-Newton algorithm is the BFGS method, named for its
Hessian, which can then be used to determine the direction to move, but we no longer have the
step size.
The BFGS algorithm addresses this by using a line search in the chosen direction to
determine how far to move in that direction. For the derivation and calculations used by
the BFGS algorithm, I recommend the resources in the further reading section at the end of
this tutorial. The size of the Hessian and its inverse is proportional to the number of input
parameters to the objective function. As such, the size of the matrix can become very large for
hundreds, thousand, or millions of parameters.
. . . the BFGS algorithm must store the inverse Hessian matrix, M , that requires
“ O(n2 ) memory, making BFGS impractical for most modern deep learning models
that typically have millions of parameters.
Ů Page 317, Deep Learning, 2016.
Limited Memory BFGS3 (or L-BFGS) is an extension to the BFGS algorithm that addresses
”
the cost of having a large number of parameters. It does this by not requiring that the entire
approximation of the inverse matrix be stored, by assuming a simpliĄcation of the inverse
Hessian in the previous iteration of the algorithm (used in the approximation).
Now that we are familiar with the BFGS algorithm from a high-level, letŠs look at how we
might make use of it.
...
result = minimize(objective, pt, method='BFGS', jac=derivative)
Program 13.1: Perform the BFGS algorithm search
def objective(x):
return x[0]**2.0 + x[1]**2.0
Program 13.2: Objective function
3
https://en.wikipedia.org/wiki/Limited-memory_BFGS
4
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html
13.4 Worked Example of BFGS 110
Next, letŠs deĄne a function for the derivative of the function, which is [2x, 2y].
def derivative(x):
return [x[0] * 2, x[1] * 2]
Program 13.3: Derivative of the objective function
We will deĄne the bounds of the function as a box with the range −5 and 5 in each
dimension.
...
r_min, r_max = -5.0, 5.0
Program 13.4: DeĄne range for input
The starting point of the search will be a randomly generated position in the search domain.
...
pt = r_min + rand(2) * (r_max - r_min)
Program 13.5: DeĄne the starting point as a random sample from the domain
We can then apply the BFGS algorithm to Ąnd the minima of the objective function by
specifying the name of the objective function, the initial point, the method we want to use
(BFGS), and the name of the derivative function.
...
result = minimize(objective, pt, method='BFGS', jac=derivative)
Program 13.6: Perform the BFGS algorithm search
We can then review the result reporting a message as to whether the algorithm Ąnished
successfully or not and the total number of evaluations of the objective function that were
performed.
...
print('Status : %s' % result['message'])
print('Total Evaluations: %d' % result['nfev'])
Program 13.7: Summarize the result
Finally, we can report the input variables that were found and their evaluation against the
objective function.
...
solution = result['x']
evaluation = objective(solution)
print('Solution: f(%s) = %.5f' % (solution, evaluation))
Program 13.8: Evaluate solution
# objective function
def objective(x):
return x[0]**2.0 + x[1]**2.0
Running the example applies the BFGS algorithm to our objective function and reports
the results.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that four iterations of the algorithm were performed and a solution
very close to the optima f (0.0, 0.0) = 0.0 was discovered, at least to a useful level of precision.
The minimize() function also supports the L-BFGS algorithm that has lower memory
requirements than BFGS. SpeciĄcally, the L-BFGS-B version of the algorithm where the -B
suffix indicates a ŞboxedŤ version of the algorithm, where the bounds of the domain can be
speciĄed. This can be achieved by specifying the ŞmethodŤ argument as "L-BFGS-B".
...
result = minimize(objective, pt, method='L-BFGS-B', jac=derivative)
Program 13.10: Perform the L-BFGS-B algorithm search
13.5 Further Reading 112
# objective function
def objective(x):
return x[0]**2.0 + x[1]**2.0
Running the example application applies the L-BFGS-B algorithm to our objective function
and reports the results.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
Again, we can see that the minima to the function is found in very few evaluations.
It might be a fun exercise to increase the dimensions of the test problem to millions of
parameters and compare the memory usage and run time of the two algorithms.
Books
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
Igor Griva, Stephen G. Nash, and Ariela Sofer. Linear and Nonlinear Optimization. 2nd ed.
SIAM, 2009.
https://amzn.to/39fWKtS
Jorge Nocedal and Stephen Wright. Numerical Optimization. 2nd ed. Springer, 2006.
https://amzn.to/3sbjF2t
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
https://amzn.to/3qSk3C2
APIs
scipy.optimize.minimize API.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.
html
Articles
Broyden-Fletcher-Goldfarb-Shanno algorithm. Wikipedia.
https://en.wikipedia.org/wiki/Broyden%5C%E2%5C%80%5C%93Fletcher%5C%E2%5C%80%
5C%93Goldfarb%5C%E2%5C%80%5C%93Shanno_algorithm
Limited-memory BFGS. Wikipedia.
https://en.wikipedia.org/wiki/Limited-memory_BFGS
13.6 Summary
In this tutorial, you discovered the BFGS second-order optimization algorithm. SpeciĄcally, you
learned:
▷ Second-order optimization algorithms are algorithms that make use of the second-order
derivative, called the Hessian matrix for multivariate objective functions.
▷ The BFGS algorithm is perhaps the most popular second-order algorithm for numerical
optimization and belongs to a group called Quasi-Newton methods.
▷ How to minimize objective functions using the BFGS and L-BFGS-B algorithms in
Python.
Next, you will learn about curve Ątting.
Least Square: Curve Fitting with
SciPy
14
Curve Ątting is a type of optimization that Ąnds an optimal set of parameters for a deĄned
function that best Ąts a given set of observations. Unlike supervised learning, curve Ątting
requires that you deĄne the function that maps examples of inputs to outputs. The mapping
function, also called the basis function can have any form you like, including a straight line
(linear regression), a curved line (polynomial regression), and much more. This provides the
Ćexibility and control to deĄne the form of the curve, where an optimization process is used to
Ąnd the speciĄc optimal parameters of the function.
In this tutorial, you will discover how to perform curve Ątting in Python. After completing
this tutorial, you will know:
▷ Curve Ątting involves Ąnding the optimal parameters to a function that maps examples
of inputs to outputs.
▷ The SciPy Python library provides an API to Ąt a curve to a dataset.
▷ How to use curve Ątting in SciPy to Ąt a range of different curves to a set of observations.
LetŠs get started.
1
https://en.wikipedia.org/wiki/Curve_fitting
14.2 Curve Fitting 115
Consider that we have collected examples of data from the problem domain with inputs
and outputs. The x-axis is the independent variable or the input to the function. The y-axis is
the dependent variable or the output of the function. We donŠt know the form of the function
that maps examples of inputs to outputs, but we suspect that we can approximate the function
with a standard function form.
Curve Ątting involves Ąrst deĄning the functional form of the mapping function (also called
the basis function2 or objective function), then searching for the parameters to the function that
result in the minimum error. Error is calculated by using the observations from the domain
and passing the inputs to our candidate mapping function and calculating the output, then
comparing the calculated output to the observed output. Once Ąt, we can use the mapping
function to interpolate or extrapolate new points in the domain. It is common to run a sequence
of input values through the mapping function to calculate a sequence of outputs, then create
a line plot of the result to show how output varies with input and how well the line Ąts the
observed points.
The key to curve Ątting is the form of the mapping function. A straight line between inputs
and outputs can be deĄned as follows:
y =a×x+b
Where y is the calculated output, x is the input, and a and b are parameters of the mapping
function found using an optimization algorithm. This is called a linear equation because it is a
weighted sum of the inputs. In a linear regression model, these parameters are referred to as
coefficients; in a neural network, they are referred to as weights.
This equation can be generalized to any number of inputs, meaning that the notion of curve
Ątting is not limited to two-dimensions (one input and one output), but could have many input
variables. For example, a line mapping function for two input variables may look as follows:
y = a1 × x1 + a2 × x2 + b
The equation does not have to be a straight line. We can add curves in the mapping function by
adding exponents. For example, we can add a squared version of the input weighted by another
parameter:
y = a × x + b × x2 + c
This is called polynomial regression3 , and the squared term means it is a second-degree polynomial.
So far, linear equations of this type can be Ąt by minimizing least squares and can be calculated
analytically. This means we can Ąnd the optimal values of the parameters using a little linear
algebra. We might also want to add other mathematical functions to the equation, such as sine,
cosine, and more. Each term is weighted with a parameter and added to the whole to give the
output; for example:
y = a × sin(b × x) + c
Adding arbitrary mathematical functions to our mapping function generally means we cannot
calculate the parameters analytically, and instead, we will need to use an iterative optimization
2
https://en.wikipedia.org/wiki/Basis_function
3
https://en.wikipedia.org/wiki/Polynomial_regression
14.3 Curve Fitting Python API 116
algorithm. This is called nonlinear least squares4 , as the objective function is no longer convex
(itŠs nonlinear) and not as easy to solve.
Now that we are familiar with curve Ątting, letŠs look at how we might perform curve Ątting
in Python.
...
x_values = ...
y_values = ...
Program 14.1: Load input variables from a Ąle
Next, we need to design a mapping function to Ąt a line to the data and implement it as a
Python function that takes inputs and the arguments. It may be a straight line, in which case
it would look as follows:
We can then call the curve_fit() function6 to Ąt a straight line to the dataset using our
deĄned function. The function curve_fit() returns the optimal values for the mapping function,
e.g, the coefficient values. It also returns a covariance matrix for the estimated parameters, but
we can ignore that for now.
...
popt, _ = curve_fit(objective, x_values, y_values)
Program 14.3: Fit a curve
Once Ąt, we can use the optimal parameters and our mapping function objective() to
calculate the output for any arbitrary input. This might include the output for the examples we
have already collected from the domain, it might include new values that interpolate observed
values, or it might include extrapolated values outside of the limits of what was observed.
4
https://en.wikipedia.org/wiki/Non-linear_least_squares
5
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
6
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
14.4 Curve Fitting Worked Example 117
...
# define new input values
x_new = ...
# unpack optima parameters for the objective function
a, b, c = popt
# use optimal parameters to calculate new values
y_new = objective(x_new, a, b, c)
Program 14.4: Use result of curve_fit() to extrapolate values
Now that we are familiar with using the curve Ątting API, letŠs look at a worked example.
Running the example loads the dataset, selects the variables, and creates a scatter plot.
We can see that there is a relationship between the two variables. SpeciĄcally, that as the
population increases, the total number of employees increases. It is not unreasonable to think
we can Ąt a line to this data.
14.4 Curve Fitting Worked Example 118
We can use curve Ątting to Ąnd the optimal values of ŞaŤ and ŞbŤ and summarize the values
that were found:
...
# curve fit
popt, _ = curve_fit(objective, x, y)
# summarize the parameter values
a, b = popt
print('y = %.5f * x + %.5f' % (a, b))
Program 14.7: Fit a curve and summarize the values found
...
pyplot.scatter(x, y)
Program 14.8: Plot input vs output
On top of the scatter plot, we can draw a line for the function with the optimized parameter
values. This involves Ąrst deĄning a sequence of input values between the minimum and
maximum values observed in the dataset (e.g. between about 120 and about 130).
14.4 Curve Fitting Worked Example 119
...
x_line = arange(min(x), max(x), 1)
Program 14.9: DeĄne a sequence of inputs between the smallest and largest known
inputs
We can then calculate the output value for each input value.
...
y_line = objective(x_line, a, b)
Program 14.10: Calculate the output for the range
Then create a line plot of the inputs vs. the outputs to see a line:
...
pyplot.plot(x_line, y_line, '--', color='red')
Program 14.11: Create a line plot for the mapping function
Tying this together, the example below uses curve Ątting to Ąnd the parameters of a straight
line for our economic data.
Running the example performs curve Ątting and Ąnds the optimal parameters to our objective
function. First, the values of the parameters are reported.
y = 0.48488 * x + 8.38067
Output 14.1: Result from Program 14.12
Next, a plot is created showing the original data and the line that was Ąt to the data. We can
see that it is a reasonably good Ąt.
So far, this is not very exciting as we could achieve the same effect by Ątting a linear
regression model on the dataset. LetŠs try a polynomial regression model by adding squared
terms to the objective function.
Next, a plot is created showing the line in the context of the observed values from the domain.
We can see that the second-degree polynomial equation that we deĄned is visually a better Ąt
for the data than the straight line that we tested Ąrst.
We could keep going and add more polynomial terms to the equation to better Ąt the curve.
For example, below is an example of a Ąfth-degree polynomial Ąt to the data.
14.4 Curve Fitting Worked Example 122
Running the example Ąts the curve and plots the result, again capturing slightly more nuance
in how the relationship in the data changes over time.
Importantly, we are not limited to linear regression or polynomial regression. We can use
any arbitrary basis function. For example, perhaps we want a line that has wiggles to capture
the short-term movement in observation. We could add a sine curve to the equation and Ąnd
the parameters that best integrate this element in the equation. For example, an arbitrary
function that uses a sine wave and a second degree polynomial is listed below:
The complete example of Ątting a curve using this basis function is listed below.
Running the example Ąts a curve and plots the result. We can see that adding a sine wave
has the desired effect showing a periodic wiggle with an upward trend that provides another
way of capturing the relationships in the data.
14.5 Further Reading 124
Books
Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
https://amzn.to/36yvG9w
APIs
scipy.optimize.curve_fit API.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_
fit.html
numpy.random.randn API.
https://numpy.org/doc/stable/reference/random/generated/numpy.random.randn.
html
Articles
Curve Ątting. Wikipedia.
https://en.wikipedia.org/wiki/Curve_fitting
14.6 Summary 125
14.6 Summary
In this tutorial, you discovered how to perform curve Ątting in Python. SpeciĄcally, you learned:
▷ Curve Ątting involves Ąnding the optimal parameters to a function that maps examples
of inputs to outputs.
▷ Unlike supervised learning, curve Ątting requires that you deĄne the function that maps
examples of inputs to outputs.
▷ How to use curve Ątting in SciPy to Ąt a range of different curves to a set of observations.
Next, you will learn about hill climbing algorithm.
Stochastic Hill Climbing
15
Stochastic Hill climbing is an optimization algorithm. It makes use of randomness as part of the
search process. This makes the algorithm appropriate for nonlinear objective functions where
other local search algorithms do not operate well. It is also a local search algorithm, meaning
that it modiĄes a single solution and searches the relatively local area of the search space
until the local optima is located. This means that it is appropriate on unimodal optimization
problems or for use after the application of a global optimization algorithm.
In this tutorial, you will discover the hill climbing optimization algorithm for function
optimization After completing this tutorial, you will know:
▷ Hill climbing is a stochastic local search algorithm for function optimization.
▷ How to implement the hill climbing algorithm from scratch in Python.
▷ How to apply the hill climbing algorithm and inspect the results of the algorithm.
LetŠs get started.
This means that the algorithm can skip over bumpy, noisy, discontinuous, or deceptive regions
of the response surface as part of the search.
Stochastic hill climbing chooses at random from among the uphill moves; the
“ probability of selection can vary with the steepness of the uphill move.
Ů Page 124, ArtiĄcial Intelligence: A Modern Approach, 2009. ”
It is important that different points with equal evaluation are accepted as it allows the
algorithm to continue to explore the search space, such as across Ćat regions of the response
surface. It may also be helpful to put a limit on these so-called ŞsidewaysŤ moves to avoid an
inĄnite loop.
If we always allow sideways moves when there are no uphill moves, an inĄnite loop
“ will occur whenever the algorithm reaches a Ćat local maximum that is not a shoulder.
One common solution is to put a limit on the number of consecutive sideways moves
allowed. For example, we could allow up to, say, 100 consecutive sideways moves
Ů Page 123, ArtiĄcial Intelligence: A Modern Approach, 2009.
This process continues until a stop condition is met, such as a maximum number of function
”
evaluations or no improvement within a given number of function evaluations. The algorithm
takes its name from the fact that it will (stochastically) climb the hill of the response surface to
the local optima. This does not mean it can only be used for maximizing objective functions; it
is just a name. In fact, typically, we minimize functions instead of maximize them.
The hill-climbing search algorithm (steepest-ascent version) [. . . ] is simply a loop
def objective(x):
# placeholder: to return a scalar value based on x
return 0
Next, we can generate our initial solution as a random point within the bounds of the problem,
then evaluate it using the objective function.
...
# generate an initial point
solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
# evaluate the initial point
solution_eval = objective(solution)
Program 15.2: Evaluate a random initial point
Now we can loop over a predeĄned number of iterations of the algorithm deĄned as
Şn_iterationsŞ, such as 100 or 1,000.
...
for i in range(n_iterations):
...
Program 15.3: Run the hill climb
The Ąrst step of the algorithm iteration is to take a step. This requires a predeĄned
Şstep_sizeŤ parameter, which is relative to the bounds of the search space. We will take a
random step with a Gaussian distribution where the mean is our current point and the standard
deviation is deĄned by the Şstep_sizeŞ. That means that about 99 percent of the steps taken
will be within (3 * step_size) of the current point.
...
candidate = solution + randn(len(bounds)) * step_size
Program 15.4: Take one random step (normal distribution)
We donŠt have to take steps in this way. You may wish to use a uniform distribution between 0
and the step size. For example:
...
candidate = solution + rand(len(bounds)) * step_size
Program 15.5: Take one random step (uniform distribution)
Next we need to evaluate the new candidate solution with the objective function.
...
candidte_eval = objective(candidate)
Program 15.6: Evaluate candidate opint
15.4 Example of Applying the Hill Climbing Algorithm 129
We then need to check if the evaluation of this new point is as good as or better than the current
best point, and if it is, replace our current best point with this new point.
...
if candidte_eval <= solution_eval:
# store the new point
solution, solution_eval = candidate, candidte_eval
# report progress
print('>%d f(%s) = %.5f' % (i, solution, solution_eval))
Program 15.7: Check if we should keep the new point
Now that we know how to implement the hill climbing algorithm in Python, letŠs look at how
we might use it to optimize an objective function.
# objective function
def objective(x):
return x[0]**2.0
Running the example creates a line plot of the objective function and clearly marks the
function optima.
Figure 15.1: Line Plot of Objective Function With Optima Marked with a Dashed Red
Line
Next, we can apply the hill climbing algorithm to the objective function.
First, we will seed the pseudorandom number generator. This is not required in general,
but in this case, I want to ensure we get the same results (same sequence of random numbers)
each time we run the algorithm so we can plot the results later.
15.4 Example of Applying the Hill Climbing Algorithm 131
...
seed(5)
Program 15.10: Seed the pseudorandom number generator
Next, we can deĄne the conĄguration of the search. In this case, we will search for 1,000
iterations of the algorithm and use a step size of 0.1. Given that we are using a Gaussian
function for generating the step, this means that about 99 percent of all steps taken will be
within a distance of 0.1 × 3 of a given point, i.e. three standard deviations.
...
n_iterations = 1000
# define the maximum step size
step_size = 0.1
Program 15.11: DeĄne the number if iterations and step size
...
best, score = hillclimbing(objective, bounds, n_iterations, step_size)
print('Done!')
print('f(%s) = %f' % (best, score))
Program 15.12: Perform the hill climbing search
# objective function
def objective(x):
return x[0]**2.0
# report progress
print('>%d f(%s) = %.5f' % (i, solution, solution_eval))
return [solution, solution_eval]
Running the example reports the progress of the search, including the iteration number, the
input to the function, and the response from the objective function each time an improvement
was detected. At the end of the search, the best solution is found and its evaluation is reported.
In this case we can see about 36 improvements over the 1,000 iterations of the algorithm and a
solution that is very close to the optimal input of 0.0 that evaluates to f (0.0) = 0.0.
It can be interesting to review the progress of the search as a line plot that shows the
change in the evaluation of the best solution each time there is an improvement. We can update
the hillclimbing() to keep track of the objective function evaluations each time there is an
improvement and return this list of scores.
We can then create a line plot of these scores to see the relative change in objective function for
each improvement found during the search.
...
pyplot.plot(scores, '.-')
pyplot.xlabel('Improvement Number')
pyplot.ylabel('Evaluation f(x)')
pyplot.show()
Program 15.15: Line plot of best scores
15.4 Example of Applying the Hill Climbing Algorithm 134
Tying this together, the complete example of performing the search and plotting the objective
function scores of the improved solutions during the search is listed below.
# objective function
def objective(x):
return x[0]**2.0
pyplot.ylabel('Evaluation f(x)')
pyplot.show()
Program 15.16: Hill climbing search of a one-dimensional objective function
Running the example performs the search and reports the results as before. A line plot is
created showing the objective function evaluation for each improvement during the hill climbing
search. We can see about 36 changes to the objective function evaluation during the search,
with large changes initially and very small to imperceptible changes towards the end of the
search as the algorithm converged on the optima.
Figure 15.2: Line Plot of Objective Function Evaluation for Each Improvement During
the Hill Climbing Search
We can then create a plot of the response surface of the objective function and mark the optima
as before.
...
# sample input range uniformly at 0.1 increments
inputs = arange(bounds[0,0], bounds[0,1], 0.1)
# create a line plot of input vs result
pyplot.plot(inputs, [objective([x]) for x in inputs], '--')
# draw a vertical line at the optimal input
pyplot.axvline(x=[0.0], ls='--', color='red')
Program 15.18: Create plot of response surface
Finally, we can plot the sequence of candidate solutions found by the search as black dots.
...
pyplot.plot(solutions, [objective(x) for x in solutions], 'o', color='black')
Program 15.19: Plot the sample as black circles
Tying this together, the complete example of plotting the sequence of improved solutions on the
response surface of the objective function is listed below.
# objective function
def objective(x):
return x[0]**2.0
Running the example performs the hill climbing search and reports the results as before. A
plot of the response surface is created as before showing the familiar bowl shape of the function
with a vertical red line marking the optima of the function. The sequence of best solutions
found during the search is shown as black dots running down the bowl shape to the optima.
15.5 Further Reading 138
Figure 15.3: Response Surface of Objective Function With Sequence of Best Solutions
Plotted as Black Dots
Books
Stuart J. Russell and Peter Norvig. ArtiĄcial Intelligence: A Modern Approach. 3rd ed. Pearson,
2009.
https://amzn.to/2HYk1Xj
APIs
numpy.random.rand API.
https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.
html
numpy.random.randn API.
https://numpy.org/doc/stable/reference/random/generated/numpy.random.randn.
html
numpy.random.seed API.
https://numpy.org/doc/stable/reference/random/generated/numpy.random.seed.
html
Articles
Hill climbing. Wikipedia.
https://en.wikipedia.org/wiki/Hill_climbing
15.6 Summary 139
15.6 Summary
In this tutorial, you discovered the hill climbing optimization algorithm for function optimization.
SpeciĄcally, you learned:
▷ Hill climbing is a stochastic local search algorithm for function optimization.
▷ How to implement the hill climbing algorithm from scratch in Python.
▷ How to apply the hill climbing algorithm and inspect the results of the algorithm.
Next, you will learn about an algorithm that repeats hill climbing with different starting
points.
Iterated Local Search
16
Iterated Local Search is a stochastic global optimization algorithm. It involves the repeated
application of a local search algorithm to modiĄed versions of a good solution found previously.
In this way, it is like a clever version of the stochastic hill climbing with random restarts
algorithm. The intuition behind the algorithm is that random restarts can help to locate many
local optima in a problem and that better local optima are often close to other local optima.
Therefore modest perturbations to existing local optima may locate better or even best solutions
to an optimization problem.
In this tutorial, you will discover how to implement the iterated local search algorithm
from scratch. After completing this tutorial, you will know:
▷ Iterated local search is a stochastic global search optimization algorithm that is a
smarter version of stochastic hill climbing with random restarts.
▷ How to implement stochastic hill climbing with random restarts from scratch.
▷ How to implement and apply the iterated local search algorithm to a nonlinear objective
function.
LetŠs get started.
“ youŠre presently in, and walking from local optimum to local optimum in this way
often outperforms just trying new locations entirely at random.
Ů Page 26, Essentials of Metaheuristics, 2011.
This allows the search to be performed at two levels. The hill climbing algorithm is the
”
local search for getting the most out of a speciĄc candidate solution or region of the search
space, and the restart approach allows different regions of the search space to be explored.
In this way, the algorithm Iterated Local Search explores multiple local optima in the search
space, increasing the likelihood of locating the global optima. The Iterated Local Search was
proposed for combinatorial optimization problems, such as the traveling salesman problem
(TSP), although it can be applied to continuous function optimization by using different step
sizes in the search space: smaller steps for the hill climbing and larger steps for the random
restart.
Now that we are familiar with the Iterated Local Search algorithm, letŠs explore how to
implement the algorithm from scratch.
1
https://en.wikipedia.org/wiki/Iterated_local_search
2
https://en.wikipedia.org/wiki/Hill_climbing
16.3 Ackley Objective Function 142
# objective function
def objective(x, y):
❈
return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi ❈
* x) + cos(2 * pi * y))) + e + 20
Running the example creates the surface plot of the Ackley function showing the vast number
of local optima.
3
https://en.wikipedia.org/wiki/Ackley_function
16.4 Stochastic Hill Climbing Algorithm 143
We will use this as the basis for implementing and comparing a simple stochastic hill
climbing algorithm, stochastic hill climbing with random restarts, and Ąnally iterated local
search. We would expect a stochastic hill climbing algorithm to get stuck easily in local minima.
We would expect stochastic hill climbing with restarts to Ąnd many local minima, and we would
expect iterated local search to perform better than either method on this problem if conĄgured
appropriately.
...
solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
Program 16.2: Generate a random point in the search space
16.4 Stochastic Hill Climbing Algorithm 144
We can generate perturbed versions of a currently working solution using a Gaussian probability
distribution with the mean of the current values in the solution and a standard deviation
controlled by a hyperparameter that controls how far the search is allowed to explore from the
current working solution. We will refer to this hyperparameter as Şstep_sizeŞ, for example:
...
candidate = solution + randn(len(bounds)) * step_size
Program 16.3: Generate a perturbed version of a current working solution
Importantly, we must check that generated solutions are within the search space. This can be
achieved with a custom function named in_bounds() that takes a candidate solution and the
bounds of the search space and returns True if the point is in the search space, False otherwise.
This function can then be called during the hill climb to conĄrm that new points are in the
bounds of the search space, and if not, new points can be generated.
Tying this together, the function hillclimbing() below implements the stochastic hill
climbing local search algorithm. It takes the name of the objective function, bounds of the
problem, number of iterations, and steps size as arguments and returns the best solution and its
evaluation.
We can test this algorithm on the Ackley function. We will Ąx the seed for the pseudorandom
number generator to ensure we get the same results each time the code is run. The algorithm
will be run for 1,000 iterations and a step size of 0.05 units will be used; both hyperparameters
were chosen after a little trial and error. At the end of the run, we will report the best solution
found.
...
# seed the pseudorandom number generator
seed(1)
# define range for input
bounds = asarray([[-5.0, 5.0], [-5.0, 5.0]])
# define the total iterations
n_iterations = 1000
# define the maximum step size
step_size = 0.05
# perform the hill climbing search
best, score = hillclimbing(objective, bounds, n_iterations, step_size)
print('Done!')
print('f(%s) = %f' % (best, score))
Program 16.6: Hill climb and report the best solution found
Tying this together, the complete example of applying the stochastic hill climbing algorithm to
the Ackley objective function is listed below.
# objective function
def objective(v):
x, y = v
❈
return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi ❈
* x) + cos(2 * pi * y))) + e + 20
return True
Running the example performs the stochastic hill climbing search on the objective function.
Each improvement found during the search is reported and the best solution is then reported at
the end of the search.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see about 13 improvements during the search and a Ąnal solution of about
f (−0.981, 1.965), resulting in an evaluation of about 5.381, which is far from f (0.0, 0.0) = 0.
16.5 Stochastic Hill Climbing With Random Restarts 147
Next, we will modify the algorithm to perform random restarts and see if we can achieve better
results.
Next, we can implement the random restart algorithm by repeatedly calling the hillclimbing()
function a Ąxed number of times. Each call, we will generate a new randomly selected starting
point for the hill climbing search.
...
# generate a random initial point for the search
start_pt = None
while start_pt is None or not in_bounds(start_pt, bounds):
start_pt = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
# perform a stochastic hill climbing search
❈
solution, solution_eval = hillclimbing(objective, bounds, n_iter, step_size, ❈
start_pt)
Program 16.9: Stochastic hill climbing with random initial point
We can then inspect the result and keep it if it is better than any result of the search we have
seen so far.
...
if solution_eval < best_eval:
best, best_eval = solution, solution_eval
print('Restart %d, best: f(%s) = %.5f' % (n, best, best_eval))
Program 16.10: Check for new best
Tying this together, the random_restarts() function implemented the stochastic hill climbing
algorithm with random restarts.
We can then apply this algorithm to the Ackley objective function. In this case, we will limit
the number of random restarts to 30, chosen arbitrarily.
The complete example is listed below.
16.5 Stochastic Hill Climbing With Random Restarts 149
# objective function
def objective(v):
x, y = v
❈
return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi ❈
* x) + cos(2 * pi * y))) + e + 20
❈
start_pt = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - ❈
bounds[:, 0])
# perform a stochastic hill climbing search
❈
solution, solution_eval = hillclimbing(objective, bounds, n_iter, ❈
step_size, start_pt)
# check for new best
if solution_eval < best_eval:
best, best_eval = solution, solution_eval
print('Restart %d, best: f(%s) = %.5f' % (n, best, best_eval))
return [best, best_eval]
Running the example will perform a stochastic hill climbing with random restarts search
for the Ackley objective function. Each time an improved overall solution is discovered, it is
reported and the Ąnal best solution found by the search is summarized.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see three improvements during the search and that the best solution
found was approximately f (0.002, 0.002), which evaluated to about 0.009, which is much better
than a single run of the hill climbing algorithm.
Next, letŠs look at how we can implement the iterated local search algorithm.
16.6 Iterated Local Search Algorithm 151
...
start_pt = None
while start_pt is None or not in_bounds(start_pt, bounds):
start_pt = best + randn(len(bounds)) * p_size
Program 16.13: Generate an initial point as a perturbed version of the last best
We can then apply the algorithm to the Ackley objective function. In this case, we will use
a larger step size value of 1.0 for the random restarts, chosen after a little trial and error.
The complete example is listed below.
16.6 Iterated Local Search Algorithm 152
# objective function
def objective(v):
x, y = v
❈
return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi ❈
* x) + cos(2 * pi * y))) + e + 20
for n in range(n_restarts):
# generate an initial point as a perturbed version of the last best
start_pt = None
while start_pt is None or not in_bounds(start_pt, bounds):
start_pt = best + randn(len(bounds)) * p_size
# perform a stochastic hill climbing search
❈
solution, solution_eval = hillclimbing(objective, bounds, n_iter, ❈
step_size, start_pt)
# check for new best
if solution_eval < best_eval:
best, best_eval = solution, solution_eval
print('Restart %d, best: f(%s) = %.5f' % (n, best, best_eval))
return [best, best_eval]
Running the example will perform an Iterated Local Search of the Ackley objective function.
Each time an improved overall solution is discovered, it is reported and the Ąnal best solution
found by the search is summarized at the end of the run.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see four improvements during the search and that the best solution
found was two very small inputs that are close to zero, which evaluated to about 0.0003, which
is better than either a single run of the hill climber or the hill climber with restarts.
Done!
f([ 1.16431936e-04 -3.31358206e-06]) = 0.000330
Output 16.3: Result from Program 16.15
Books
Sean Luke. Essentials of Metaheuristics. lulu.com, 2011.
https://amzn.to/3lHryZr
Michel Gendreau and Jean-Yves Potvin. Handbook of Metaheuristics. 3rd ed. Springer, 2019.
https://amzn.to/2IIq0Qt
Articles
Hill climbing. Wikipedia.
https://en.wikipedia.org/wiki/Hill_climbing
Iterated local search. Wikipedia.
https://en.wikipedia.org/wiki/Iterated_local_search
16.8 Summary
In this tutorial, you discovered how to implement the iterated local search algorithm from
scratch. SpeciĄcally, you learned:
▷ Iterated local search is a stochastic global search optimization algorithm that is a
smarter version of stochastic hill climbing with random restarts.
▷ How to implement stochastic hill climbing with random restarts from scratch.
▷ How to implement and apply the iterated local search algorithm to a nonlinear objective
function.
Next, you will depart from local optimization and learn about global optimization, start
with genetic algorithm.
IV
Global Optimization
Simple Genetic Algorithm from
Scratch
17
The genetic algorithm is a stochastic global optimization algorithm. It may be one of the most
popular and widely known biologically inspired algorithms, along with artiĄcial neural networks.
The algorithm is a type of evolutionary algorithm and performs an optimization procedure
inspired by the biological theory of evolution by means of natural selection with a binary
representation and simple operators based on genetic recombination and genetic mutations.
In this tutorial, you will discover the genetic algorithm optimization algorithm. After
completing this tutorial, you will know:
▷ Genetic algorithm is a stochastic optimization algorithm inspired by evolution.
▷ How to implement the genetic algorithm from scratch in Python.
▷ How to apply the genetic algorithm to a continuous objective function.
LetŠs get started.
1
https://en.wikipedia.org/wiki/Genetic_algorithm
17.2 Genetic Algorithm 157
“ where Ątter individuals are more likely to pass on their genes to the next generation.
Ů Page 148, Algorithms for Optimization, 2019. ”
The algorithm uses analogs of a genetic representation (bitstrings), Ątness (function
evaluations), genetic recombination (crossover of bitstrings), and mutation (Ćipping bits).
The algorithm works by Ąrst creating a population of a Ąxed size of random bitstrings. The
main loop of the algorithm is repeated for a Ąxed number of iterations or until no further
improvement is seen in the best solution over a given number of iterations.
One iteration of the algorithm is like an evolutionary generation. First, the population
of bitstrings (candidate solutions) are evaluated using the objective function. The objective
function evaluation for each candidate solution is taken as the Ątness of the solution, which
may be minimized or maximized. Then, parents are selected based on their Ątness. A given
candidate solution may be used as parent zero or more times. A simple and effective approach
to selection involves drawing k candidates from the population randomly and selecting the
member from the group with the best Ątness. This is called tournament selection where k is a
hyperparameter and set to a value such as 3. This simple approach simulates a more costly
Ątness-proportionate selection scheme.
In tournament selection, each parent is the Ąttest out of k randomly chosen
“ and matching parts of two parents to form children. How you do that mixing and
matching depends on the representation of the individuals.
Ů Page 36, Essentials of Metaheuristics, 2011. ”
17.3 Genetic Algorithm From Scratch 158
Mutation involves Ćipping bits in created children candidate solutions. Typically, the
mutation rate is set to 1/L, where L is the length of the bitstring.
Each bit in a binary-valued chromosome typically has a small probability of being
“ Ćipped. For a chromosome with m bits, this mutation rate is typically set to 1/m,
yielding an average of one mutation per child chromosome.
Ů Page 155, Algorithms for Optimization, 2019.
For example, if a problem used a bitstring with 20 bits, then a good default mutation rate
”
would be (1/20) = 0.05 or a probability of 5 percent. This deĄnes the simple genetic algorithm
procedure. It is a large Ąeld of study, and there are many extensions to the algorithm.
Now that we are familiar with the simple genetic algorithm procedure, letŠs look at how we
might implement it from scratch.
...
pop = [randint(0, 2, n_bits).tolist() for _ in range(n_pop)]
Program 17.1: Initial population of random bitstring
Next, we can enumerate over a Ąxed number of algorithm iterations, in this case, controlled by
a hyperparameter named Şn_iterŞ.
...
for gen in range(n_iter):
...
Program 17.2: Enumerate generations
The Ąrst step in the algorithm iteration is to evaluate all candidate solutions. We will use a
function named objective() as a generic objective function and call it to get a Ątness score,
which we will minimize.
...
scores = [objective(c) for c in pop]
Program 17.3: Evaluate all candidates in the population
2
https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html
17.3 Genetic Algorithm From Scratch 159
We can then select parents that will be used to create children. The tournament selection
procedure can be implemented as a function that takes the population and returns one selected
parent. The k value is Ąxed at 3 with a default argument, but you can experiment with different
values if you like.
We can then call this function one time for each position in the population to create a list of
parents.
...
selected = [selection(pop, scores) for _ in range(n_pop)]
Program 17.5: Select parents
We can then create the next generation. This Ąrst requires a function to perform crossover. This
function will take two parents and the crossover rate. The crossover rate is a hyperparameter
that determines whether crossover is performed or not, and if not, the parents are copied into
the next generation. It is a probability and typically has a large value close to 1.0.
The crossover() function below implements crossover using a draw of a random number
in the range [0,1] to determine if crossover is performed, then selecting a valid split point if
crossover is to be performed.
We also need a function to perform mutation. This procedure simply Ćips bits with a low
probability controlled by the Şr_mutŤ hyperparameter.
17.3 Genetic Algorithm From Scratch 160
We can then loop over the list of parents and create a list of children to be used as the next
generation, calling the crossover and mutation functions as needed.
...
children = list()
for i in range(0, n_pop, 2):
# get selected parents in pairs
p1, p2 = selected[i], selected[i+1]
# crossover and mutation
for c in crossover(p1, p2, r_cross):
# mutation
mutation(c, r_mut)
# store for next generation
children.append(c)
Program 17.8: Create the next generation
We can tie all of this together into a function named genetic_algorithm() that takes the name
of the objective function and the hyperparameters of the search, and returns the best solution
found during the search.
# mutation
mutation(c, r_mut)
# store for next generation
children.append(c)
# replace population
pop = children
return [best, best_eval]
Program 17.9: Genetic algorithm
Now that we have developed an implementation of the genetic algorithm, letŠs explore how we
might apply it to an objective function.
def onemax(x):
return -sum(x)
Program 17.10: Objective function
Next, we can conĄgure the search. The search will run for 100 iterations and we will use 20 bits
in our candidate solutions, meaning the optimal Ątness will be −20.0. The population size will
be 100, and we will use a crossover rate of 90 percent and a mutation rate of 5 percent. This
conĄguration was chosen after a little trial and error.
...
# define the total iterations
n_iter = 100
# bits
n_bits = 20
# define the population size
n_pop = 100
# crossover rate
r_cross = 0.9
# mutation rate
r_mut = 1.0 / float(n_bits)
Program 17.11: DeĄne the hyperparameters
The search can then be called and the best result reported.
17.4 Genetic Algorithm for OneMax 162
...
best, score = genetic_algorithm(onemax, n_bits, n_iter, n_pop, r_cross, r_mut)
print('Done!')
print('f(%s) = %f' % (best, score))
Program 17.12: Perform the genetic algorithm search
Tying this together, the complete example of applying the genetic algorithm to the OneMax
objective function is listed below.
# objective function
def onemax(x):
return -sum(x)
# tournament selection
def selection(pop, scores, k=3):
# first random selection
selection_ix = randint(len(pop))
for ix in randint(0, len(pop), k-1):
# check if better (e.g. perform a tournament)
if scores[ix] < scores[selection_ix]:
selection_ix = ix
return pop[selection_ix]
# mutation operator
def mutation(bitstring, r_mut):
for i in range(len(bitstring)):
# check for a mutation
if rand() < r_mut:
# flip the bit
bitstring[i] = 1 - bitstring[i]
# genetic algorithm
def genetic_algorithm(objective, n_bits, n_iter, n_pop, r_cross, r_mut):
# initial population of random bitstring
pop = [randint(0, 2, n_bits).tolist() for _ in range(n_pop)]
# keep track of best solution
best, best_eval = 0, objective(pop[0])
17.4 Genetic Algorithm for OneMax 163
# enumerate generations
for gen in range(n_iter):
# evaluate all candidates in the population
scores = [objective(c) for c in pop]
# check for new best solution
for i in range(n_pop):
if scores[i] < best_eval:
best, best_eval = pop[i], scores[i]
print(">%d, new best f(%s) = %.3f" % (gen, pop[i], scores[i]))
# select parents
selected = [selection(pop, scores) for _ in range(n_pop)]
# create the next generation
children = list()
for i in range(0, n_pop, 2):
# get selected parents in pairs
p1, p2 = selected[i], selected[i+1]
# crossover and mutation
for c in crossover(p1, p2, r_cross):
# mutation
mutation(c, r_mut)
# store for next generation
children.append(c)
# replace population
pop = children
return [best, best_eval]
Running the example will report the best result as it is found along the way, then the Ąnal
best solution at the end of the search, which we would expect to be the optimal solution.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the search found the optimal solution after about eight
generations.
17.5 Genetic Algorithm for Continuous Function Optimization 164
def objective(x):
return x[0]**2.0 + x[1]**2.0
Program 17.14: Objective function
We can minimize this function with a genetic algorithm. First, we must deĄne the bounds of
each input variable.
...
bounds = [[-5.0, 5.0], [-5.0, 5.0]]
Program 17.15: DeĄne range for input
We will take the Şn_bitsŤ hyperparameter as a number of bits per input variable to the objective
function and set it to 16 bits.
...
n_bits = 16
Program 17.16: DeĄne bits per variable
This means our actual bit string will have (16 × 2) = 32 bits, given the two input variables. We
must update our mutation rate accordingly.
...
r_mut = 1.0 / (float(n_bits) * len(bounds))
Program 17.17: DeĄne mutation rate
Next, we need to ensure that the initial population creates random bitstrings that are large
enough.
17.5 Genetic Algorithm for Continuous Function Optimization 165
...
pop = [randint(0, 2, n_bits*len(bounds)).tolist() for _ in range(n_pop)]
Program 17.18: Initial population of random bitstring
Finally, we need to decode the bitstrings to numbers prior to evaluating each with the objective
function. We can achieve this by Ąrst decoding each substring to an integer, then scaling the
integer to the desired range. This will give a vector of values in the range that can then be
provided to the objective function for evaluation. The decode() function below implements
this, taking the bounds of the function, the number of bits per variable, and a bitstring as input
and returns a list of decoded real values.
We can then call this at the beginning of the algorithm loop to decode the population, then
evaluate the decoded version of the population.
...
# decode population
decoded = [decode(bounds, n_bits, p) for p in pop]
# evaluate all candidates in the population
scores = [objective(d) for d in decoded]
Program 17.20: Evaluate candidates in the population
Tying this together, the complete example of the genetic algorithm for continuous function
optimization is listed below.
# objective function
def objective(x):
return x[0]**2.0 + x[1]**2.0
# tournament selection
def selection(pop, scores, k=3):
# first random selection
selection_ix = randint(len(pop))
for ix in randint(0, len(pop), k-1):
# check if better (e.g. perform a tournament)
if scores[ix] < scores[selection_ix]:
selection_ix = ix
return pop[selection_ix]
# mutation operator
def mutation(bitstring, r_mut):
for i in range(len(bitstring)):
# check for a mutation
if rand() < r_mut:
# flip the bit
bitstring[i] = 1 - bitstring[i]
# genetic algorithm
def genetic_algorithm(objective, bounds, n_bits, n_iter, n_pop, r_cross, r_mut):
# initial population of random bitstring
pop = [randint(0, 2, n_bits*len(bounds)).tolist() for _ in range(n_pop)]
# keep track of best solution
best, best_eval = 0, objective(decode(bounds, n_bits, pop[0]))
17.5 Genetic Algorithm for Continuous Function Optimization 167
# enumerate generations
for gen in range(n_iter):
# decode population
decoded = [decode(bounds, n_bits, p) for p in pop]
# evaluate all candidates in the population
scores = [objective(d) for d in decoded]
# check for new best solution
for i in range(n_pop):
if scores[i] < best_eval:
best, best_eval = pop[i], scores[i]
print(">%d, new best f(%s) = %f" % (gen, decoded[i], scores[i]))
# select parents
selected = [selection(pop, scores) for _ in range(n_pop)]
# create the next generation
children = list()
for i in range(0, n_pop, 2):
# get selected parents in pairs
p1, p2 = selected[i], selected[i+1]
# crossover and mutation
for c in crossover(p1, p2, r_cross):
# mutation
mutation(c, r_mut)
# store for next generation
children.append(c)
# replace population
pop = children
return [best, best_eval]
Running the example reports the best decoded results along the way and the best decoded
solution at the end of the run.
17.6 Further Reading 168
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the algorithm discovers an input very close to f (0.0, 0.0) = 0.0.
Books
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
Sean Luke. Essentials of Metaheuristics. lulu.com, 2011.
https://amzn.to/3lHryZr
David E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. 13th
ed. Addison-Wesley, 1989.
https://amzn.to/3jADHgZ
Melanie Mitchell. An Introduction to Genetic Algorithms. MIT Press, 1998.
https://amzn.to/3kK8Osd
Andries P. Engelbrecht. Computational Intelligence: An Introduction. 2nd ed. Wiley, 2007.
https://amzn.to/3ob61KA
APIs
numpy.random.randint API.
https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.
html
Articles
Genetic algorithm. Wikipedia.
https://en.wikipedia.org/wiki/Genetic_algorithm
Genetic algorithms. Scholarpedia.
http://www.scholarpedia.org/article/Genetic_algorithms
17.7 Summary
In this tutorial, you discovered the genetic algorithm optimization. SpeciĄcally, you learned:
▷ Genetic algorithm is a stochastic optimization algorithm inspired by evolution.
▷ How to implement the genetic algorithm from scratch in Python.
▷ How to apply the genetic algorithm to a continuous objective function.
Next, you will learn about evolution strategies.
Evolution Strategies
18
Evolution strategies is a stochastic global optimization algorithm. It is an evolutionary algorithm
related to others, such as the genetic algorithm, although it is designed speciĄcally for continuous
function optimization.
In this tutorial, you will discover how to implement the evolution strategies optimization
algorithm. After completing this tutorial, you will know:
▷ Evolution Strategies is a stochastic global optimization algorithm inspired by the
biological theory of evolution by natural selection.
▷ There is a standard terminology for Evolution Strategies and two common versions of
the algorithm referred to as (µ, λ)-ES and (µ + λ)-ES.
▷ How to implement and apply the Evolution Strategies algorithm to continuous objective
functions.
LetŠs get started.
1
https://en.wikipedia.org/wiki/Evolution_strategy
18.3 Develop a (µ, λ)-ES 171
The Ackley function2 is an example of a multimodal objective function that has a single
global optima and multiple local optima in which a local search might get stuck. As such, a
global optimization technique is required. It is a two-dimensional objective function that has a
global optima at [0,0], which evaluates to 0.0. The example below implements the Ackley and
creates a three-dimensional surface plot showing the global optima and multiple local optima.
# objective function
def objective(x, y):
❈
return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi ❈
* x) + cos(2 * pi * y))) + e + 20
Running the example creates the surface plot of the Ackley function showing the vast number
of local optima.
2
https://en.wikipedia.org/wiki/Ackley_function
18.3 Develop a (µ, λ)-ES 173
We can then use this function when generating the initial population of λ (i.e., ŞlamŤ) random
candidate solutions. For example:
...
population = list()
for _ in range(lam):
candidate = None
while candidate is None or not in_bounds(candidate, bounds):
candidate = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
population.append(candidate)
Program 18.3: Initial population
18.3 Develop a (µ, λ)-ES 174
Next, we can iterate over a Ąxed number of iterations of the algorithm. Each iteration Ąrst
involves evaluating each candidate solution in the population. We will calculate the scores and
store them in a separate parallel list.
...
scores = [objective(c) for c in population]
Program 18.4: Evaluate Ątness for the population
Next, we need to select the µ (i.e., ŞmuŤ) parents with the best scores, lowest scores in this case,
as we are minimizing the objective function. We will do this in two steps. First, we will rank
the candidate solutions based on their scores in ascending order so that the solution with the
lowest score has a rank of 0, the next has a rank 1, and so on. We can use a double call of the
argsort() function3 to achieve this. We will then use the ranks and select those parents that
have a rank below the value Şmu.Ť This means if mu is set to 5 to select 5 parents, only those
parents with a rank between 0 and 4 will be selected.
...
# rank scores in ascending order
ranks = argsort(argsort(scores))
# select the indexes for the top mu ranked solutions
selected = [i for i,_ in enumerate(ranks) if ranks[i] < mu]
Program 18.5: Select parents by ranking
We can then create children for each selected parent. First, we must calculate the total number
of children to create per parent.
...
n_children = int(lam / mu)
Program 18.6: Calculate the number of children per parent
We can then iterate over each parent and create modiĄed versions of each. We will create
children using a similar technique used in stochastic hill climbing. SpeciĄcally, each variable will
be sampled using a Gaussian distribution with the current value as the mean and the standard
deviation provided as a Şstep_sizeŤ hyperparameter.
...
for _ in range(n_children):
child = None
while child is None or not in_bounds(child, bounds):
child = population[i] + randn(len(bounds)) * step_size
Program 18.7: Create children for parent
We can also check if each selected parent is better than the best solution seen so far so that we
can return the best solution at the end of the search.
3
https://numpy.org/doc/stable/reference/generated/numpy.argsort.html
18.3 Develop a (µ, λ)-ES 175
...
if scores[i] < best_eval:
best, best_eval = population[i], scores[i]
print('%d, Best: f(%s) = %.5f' % (epoch, best, best_eval))
Program 18.8: Check if this parent is the best solution ever seen
The created children can be added to a list and we can replace the population with the list of
children at the end of the algorithm iteration.
...
population = children
Program 18.9: Replace population with children
We can tie all of this together into a function named es_comma() that performs the comma
version of the Evolution Strategy algorithm. The function takes the name of the objective
function, the bounds of the search space, the number of iterations, the step size, and the mu
and lambda hyperparameters and returns the best solution found during the search and its
evaluation.
children.append(child)
# replace population with children
population = children
return [best, best_eval]
Program 18.10: Evolution strategy (µ, λ) algorithm
Next, we can apply this algorithm to our Ackley objective function. We will run the algorithm
for 5,000 iterations and use a step size of 0.15 in the search space. We will use a population size
(λ) of 100 select 20 parents (µ). These hyperparameters were chosen after a little trial and error.
At the end of the search, we will report the best candidate solution found during the search.
...
# seed the pseudorandom number generator
seed(1)
# define range for input
bounds = asarray([[-5.0, 5.0], [-5.0, 5.0]])
# define the total iterations
n_iter = 5000
# define the maximum step size
step_size = 0.15
# number of parents selected
mu = 20
# the number of children generated by parents
lam = 100
# perform the evolution strategy (mu, lambda) search
best, score = es_comma(objective, bounds, n_iter, step_size, mu, lam)
print('Done!')
print('f(%s) = %f' % (best, score))
Program 18.11: Search and report the best candidate solution
Tying this together, the complete example of applying the comma version of the Evolution
Strategies algorithm to the Ackley objective function is listed below.
# objective function
def objective(v):
x, y = v
❈
return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi ❈
* x) + cos(2 * pi * y))) + e + 20
Running the example reports the candidate solution and scores each time a better solution is
found, then reports the best solution found at the end of the search.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that about 22 improvements to performance were seen during the
search and the best solution is close to the optima. No doubt, this solution can be provided as
a starting point to a local search algorithm to be further reĄned, a common practice when using
a global optimization algorithm like ES.
Now that we are familiar with how to implement the comma version of evolution strategies,
letŠs look at how we might implement the plus version.
18.4 Develop a (µ + λ)-ES 179
...
children.append(population[i])
Program 18.13: Keep the parent
The updated version of the function with this addition, and with a new name es_plus(), is
listed below.
child = None
while child is None or not in_bounds(child, bounds):
child = population[i] + randn(len(bounds)) * step_size
children.append(child)
# replace population with children
population = children
return [best, best_eval]
Program 18.14: Evolution strategy (µ + λ) algorithm
We can apply this version of the algorithm to the Ackley objective function with the same
hyperparameters used in the previous section.
The complete example is listed below.
# objective function
def objective(v):
x, y = v
❈
return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi ❈
* x) + cos(2 * pi * y))) + e + 20
Running the example reports the candidate solution and scores each time a better solution is
found, then reports the best solution found at the end of the search.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
18.5 Further Reading 182
In this case, we can see that about 24 improvements to performance were seen during the
search. We can also see that a better Ąnal solution was found with an evaluation of 0.000532,
compared to 0.001147 found with the comma version on this objective function.
Papers
Hans-Georg Beyer and Hans-Paul Schwefel. ŞEvolution Strategies Ů A Comprehensive
IntroductionŤ. Natural Computing, 1, 2002, pp. 3Ű52.
https://link.springer.com/article/10.1023/A:1015059928466
Books
Sean Luke. Essentials of Metaheuristics. lulu.com, 2011.
https://amzn.to/3lHryZr
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
18.6 Summary 183
Articles
Evolution strategy. Wikipedia.
https://en.wikipedia.org/wiki/Evolution_strategy
Evolution strategies. Scholarpedia.
http://www.scholarpedia.org/article/Evolution_strategies
18.6 Summary
In this tutorial, you discovered how to implement the evolution strategies optimization algorithm.
SpeciĄcally, you learned:
▷ Evolution Strategies is a stochastic global optimization algorithm inspired by the
biological theory of evolution by natural selection.
▷ There is a standard terminology for Evolution Strategies and two common versions of
the algorithm referred to as (µ, λ)-ES and (µ + λ)-ES.
▷ How to implement and apply the Evolution Strategies algorithm to continuous objective
functions.
Next, you will learn about differential evolution.
Differential Evolution
19
Differential evolution is a heuristic approach for the global optimization of nonlinear and
non-differentiable continuous space functions. The differential evolution algorithm belongs
to a broader family of evolutionary computing algorithms. Similar to other popular direct
search approaches, such as genetic algorithms and evolution strategies, the differential evolution
algorithm starts with an initial population of candidate solutions. These candidate solutions
are iteratively improved by introducing mutations into the population, and retaining the Ąttest
candidate solutions that yield a lower objective function value. The differential evolution
algorithm is advantageous over the aforementioned popular approaches because it can handle
nonlinear and non-differentiable multi-dimensional objective functions, while requiring very few
control parameters to steer the minimisation. These characteristics make the algorithm easier
and more practical to use.
In this tutorial, you will discover the differential evolution algorithm for global optimization.
After completing this tutorial, you will know:
▷ Differential evolution is a heuristic approach for the global optimization of nonlinear
and non-differentiable continuous space functions.
▷ How to implement the differential evolution algorithm from scratch in Python.
▷ How to apply the differential evolution algorithm to a real-valued 2D objective function.
LetŠs get started.
“ population vectors to a third vector. Let this operation be called mutation. In order
to increase the diversity of the perturbed parameter vectors, crossover is introduced.
Ů ŞDifferential evolutionŤ, 1997.
These mutations are generated according to a mutation strategy, which follows a general
”
naming convention of DE/x/y/z, where DE stands for Differential Evolution, while x denotes
the vector to be mutated, y denotes the number of difference vectors considered for the mutation
of x, and z is the type of crossover in use. For instance, the popular strategies:
▷ DE/rand/1/bin
▷ DE/best/2/bin
Specify that vector x can either be picked randomly (rand) from the population, or else the
vector with the lowest cost (best) is selected; that the number of difference vectors under
consideration is either 1 or 2; and that crossover is performed according to independent binomial
(bin) experiments. The DE/best/2/bin strategy, in particular, appears to be highly beneĄcial in
improving the diversity of the population if the population size is large enough.
19.3 Differential Evolution Algorithm From Scratch 186
The usage of two difference vectors seems to improve the diversity of the population
“ the scale factor, the crossover rate and the population size) Ů a feature that makes
it easy to use for the practitioners.
Ů ŞRecent advances in differential evolution Ů An updated surveyŤ, 2016.
There have been further variants to the canonical differential evolution algorithm described
”
above, which one may read on in Recent advances in differential evolution - An updated survey 1 ,
2016.
Now that we are familiar with the differential evolution algorithm, letŠs look at how to
implement it from scratch.
...
pop = bounds[:, 0] + (rand(pop_size, len(bounds)) * (bounds[:, 1] - bounds[:, 0]))
Program 19.1: Initialise population of candidate solutions randomly within the
speciĄed bounds
1
https://link.springer.com/article/10.1007/s10462-009-9137-2
19.3 Differential Evolution Algorithm From Scratch 187
It is within these same bounds that the objective function will also be evaluated. An objective
function of choice and the bounds on each input variable may, therefore, be deĄned as follows:
def obj(x):
return 0
We can evaluate our initial population of candidate solutions by passing it to the objective
function as input argument.
...
obj_all = [obj(ind) for ind in pop]
Program 19.3: Evaluate initial population of candidate solutions
We shall be replacing the values in obj_all with better ones as the population evolves and
converges towards an optimal solution. We can then loop over a predeĄned number of iterations
of the algorithm, such as 100 or 1,000, as speciĄed by parameter, iter, as well as over all
candidate solutions.
...
for i in range(iter):
# iterate over all candidate solutions
for j in range(pop_size):
...
Program 19.4: Run iterations of the algorithm
The Ąrst step of the algorithm iteration performs a mutation process. For this purpose, three
random candidates, a, b and c, that are not the current one, are randomly selected from the
population and a mutated vector is generated by computing: a + F × (b − c). Recall that
F ∈ [0, 2] and denotes the mutation scale factor.
...
candidates = [candidate for candidate in range(pop_size) if candidate != j]
a, b, c = pop[choice(candidates, 3, replace=False)]
Program 19.5: Choose three candidates that are not the current one
The mutation process is performed by the function, mutation, to which we pass [a, b, c] and
F as input arguments.
Since we are operating within a bounded range of values, we need to check whether the newly
mutated vector is also within the speciĄed bounds, and if not, clip its values to the upper or
lower limits as necessary. This check is carried out by the function, check_bounds.
The next step performs crossover, where speciĄc values of the current vector (target) are
replaced by the corresponding values in the mutated vector, to create a trial vector. The
decision of which values to replace is based on whether a uniform random value generated for
each input variable falls below a crossover rate. If it does, then the corresponding values from
the mutated vector are copied to the target vector. The crossover process is implemented by
the crossover() function, which takes the mutated and target vectors as input, as well as the
crossover rate, cr∈ [0, 1], and the number of input variables.
A Ąnal selection step replaces the target vector by the trial vector if the latter yields a lower
objective function value. For this purpose, we evaluate both vectors on the objective function
and subsequently perform selection, storing the new objective function value in obj_all if the
trial vector is found to be the Ąttest of the two.
...
# compute objective function value for target vector
obj_target = obj(pop[j])
# compute objective function value for trial vector
obj_trial = obj(trial)
# perform selection
if obj_trial < obj_target:
# replace the target vector with the trial vector
pop[j] = trial
# store the new objective function value
obj_all[j] = obj_trial
Program 19.9: Selection of the best vector
We can tie all steps together into a differential_evolution() function that takes as input
arguments the population size, the bounds of each input variable, the total number of iterations,
19.3 Differential Evolution Algorithm From Scratch 189
the mutation scale factor and the crossover rate, and returns the best solution found and its
evaluation.
Now that we have implemented the differential evolution algorithm, letŠs investigate how to use
it to optimize an objective function.
19.4 Differential Evolution Algorithm on the Sphere Function 190
def obj(x):
return x[0]**2.0 + x[1]**2.0
Program 19.11: DeĄne objective function
We will minimise this objective function with the differential evolution algorithm, based on the
strategy DE/rand/1/bin. In order to do so, we must deĄne values for the algorithm parameters,
speciĄcally for the population size, the number of iterations, the mutation scale factor and the
crossover rate. We set these values empirically to, 10, 100, 0.5 and 0.7 respectively.
...
# define population size
pop_size = 10
# define number of iterations
iter = 100
# define scale factor for mutation
F = 0.5
# define crossover rate for recombination
cr = 0.7
Program 19.12: DeĄne hyperparameters
...
bounds = asarray([(-5.0, 5.0), (-5.0, 5.0)])
Program 19.13: DeĄne lower and upper bounds for every dimension
...
solution = differential_evolution(pop_size, bounds, iter, F, cr)
Program 19.14: Perform differential evolution
Running the example reports the progress of the search including the iteration number, and the
response from the objective function each time an improvement is detected. At the end of the
search, the best solution is found and its evaluation is reported.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the algorithm converges very close to f (0.0, 0.0) = 0.0 in about
33 improvements out of 100 iterations.
We can plot the objective function values returned at every improvement by modifying the
differential_evolution() function slightly to keep track of the objective function values and
return this in the list, obj_iter.
a, b, c = pop[choice(candidates, 3, replace=False)]
# perform mutation
mutated = mutation([a, b, c], F)
# check that lower and upper bounds are retained after mutation
mutated = check_bounds(mutated, bounds)
# perform crossover
trial = crossover(mutated, pop[j], len(bounds), cr)
# compute objective function value for target vector
obj_target = obj(pop[j])
# compute objective function value for trial vector
obj_trial = obj(trial)
# perform selection
if obj_trial < obj_target:
# replace the target vector with the trial vector
pop[j] = trial
# store the new objective function value
obj_all[j] = obj_trial
# find the best performing vector at each iteration
best_obj = min(obj_all)
# store the lowest objective function value
if best_obj < prev_obj:
best_vector = pop[argmin(obj_all)]
prev_obj = best_obj
obj_iter.append(best_obj)
# report progress at each iteration
❈
print('Iteration: %d f([%s]) = %.5f' % (i, around(best_vector, ❈
decimals=5), best_obj))
return [best_vector, best_obj, obj_iter]
Program 19.16: Differential evolution function with value of each iteration stored
We can then create a line plot of these objective function values to see the relative changes at
every improvement during the search.
obj_target = obj(pop[j])
# compute objective function value for trial vector
obj_trial = obj(trial)
# perform selection
if obj_trial < obj_target:
# replace the target vector with the trial vector
pop[j] = trial
# store the new objective function value
obj_all[j] = obj_trial
# find the best performing vector at each iteration
best_obj = min(obj_all)
# store the lowest objective function value
if best_obj < prev_obj:
best_vector = pop[argmin(obj_all)]
prev_obj = best_obj
obj_iter.append(best_obj)
# report progress at each iteration
❈
print('Iteration: %d f([%s]) = %.5f' % (i, around(best_vector, ❈
decimals=5), best_obj))
return [best_vector, best_obj, obj_iter]
Running the example creates a line plot. The line plot shows the objective function evaluation
for each improvement, with large changes initially and very small changes towards the end of
the search as the algorithm converged on the optima.
19.5 Further Reading 197
Figure 19.1: Line Plot of Objective Function Evaluation for Each Improvement During
the Differential Evolution Search
Papers
Rainer Storn and Kenneth Price. ŞDifferential evolution Ů A simple and efficient heuristic for
global optimization over continuous spacesŤ. Journal of Global Optimization, 11(4), 1997,
pp. 341Ű359.
https://link.springer.com/article/10.1023/A:1008202821328
Swagatam Das, Sankha Subhra Mullick, and P. N. Suganthan. ŞRecent advances in differential
evolution Ů An updated surveyŤ. Swarm and Evolutionary Computation, 27, Apr. 2016,
pp. 1Ű30.
https://www.sciencedirect.com/science/article/abs/pii/S2210650216000146
Books
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
Articles
Differential evolution. Wikipedia.
https://en.wikipedia.org/wiki/Differential_evolution
19.6 Summary 198
19.6 Summary
In this tutorial, you discovered the differential evolution algorithm. SpeciĄcally, you learned:
▷ Differential evolution is a heuristic approach for the global optimization of nonlinear
and non- differentiable continuous space functions.
▷ How to implement the differential evolution algorithm from scratch in Python.
▷ How to apply the differential evolution algorithm to a real-valued 2D objective function.
Next, you will learn about simulated annealing, an algorithm that is not in the family of
evolution algorithms.
Simulated Annealing from
Scratch
20
Simulated Annealing is a stochastic global search optimization algorithm. This means that it
makes use of randomness as part of the search process. This makes the algorithm appropriate
for nonlinear objective functions where other local search algorithms do not operate well. Like
the stochastic hill climbing local search algorithm, it modiĄes a single solution and searches the
relatively local area of the search space until the local optima is located. Unlike the hill climbing
algorithm, it may accept worse solutions as the current working solution. The likelihood of
accepting worse solutions starts high at the beginning of the search and decreases with the
progress of the search, giving the algorithm the opportunity to Ąrst locate the region for the
global optima, escaping local optima, then hill climb to the optima itself.
In this tutorial, you will discover the simulated annealing optimization algorithm for function
optimization. After completing this tutorial, you will know:
▷ Simulated annealing is a stochastic global search algorithm for function optimization.
▷ How to implement the simulated annealing algorithm from scratch in Python.
▷ How to use the simulated annealing algorithm and inspect the results of the algorithm.
LetŠs get started.
“ random motion, tend to settle into better positions. A slow cooling brings the
material to an ordered, crystalline state.
Ů Page 128, Algorithms for Optimization, 2019.
The simulated annealing optimization algorithm can be thought of as a modiĄed version
”
of stochastic hill climbing. Stochastic hill climbing maintains a single candidate solution and
takes steps of a random but constrained size from the candidate in the search space. If the
new point is better than the current point, then the current point is replaced with the new
point. This process continues for a Ąxed number of iterations. Simulated annealing executes the
search in the same way. The main difference is that new points that are not as good as the
current point (worse points) are accepted sometimes. A worse point is accepted probabilistically
where the likelihood of accepting a solution worse than the current solution is a function of the
temperature of the search and how much worse the solution is than the current solution.
The algorithm varies from Hill-Climbing in its decision of when to replace S, the
1
https://en.wikipedia.org/wiki/Simulated_annealing
2
https://en.wikipedia.org/wiki/Annealing_(metallurgy)
20.3 Implement Simulated Annealing 201
It is this acceptance probability, known as the Metropolis criterion, that allows the
“ space, with the hope that in this phase the process will Ąnd a good region with the
best local minimum. The temperature is then slowly brought down, reducing the
stochasticity and forcing the search to converge to a minimum
Ů Page 128, Algorithms for Optimization, 2019.
Now that we are familiar with the simulated annealing algorithm, letŠs look at how to
”
implement it from scratch.
def objective(x):
return 0
Next, we can generate our initial point as a random point within the bounds of the problem,
then evaluate it using the objective function.
3
https://en.wikipedia.org/wiki/E_(mathematical_constant)
20.3 Implement Simulated Annealing 202
...
# generate an initial point
best = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
# evaluate the initial point
best_eval = objective(best)
Program 20.2: Evaluate objective function at a random point
We need to maintain the ŞcurrentŤ solution that is the focus of the search and that may be
replaced with better solutions.
...
curr, curr_eval = best, best_eval
Program 20.3: Save the current working solution
Now we can loop over a predeĄned number of iterations of the algorithm deĄned as
Şn_iterationsŞ, such as 100 or 1,000.
...
for i in range(n_iterations):
...
Program 20.4: Run the algorithm
The Ąrst step of the algorithm iteration is to generate a new candidate solution from the current
working solution, e.g. take a step. This requires a predeĄned Şstep_sizeŤ parameter, which
is relative to the bounds of the search space. We will take a random step with a Gaussian
distribution where the mean is our current point and the standard deviation is deĄned by the
Şstep_sizeŞ. That means that about 99 percent of the steps taken will be within 3 * step_size
of the current point.
...
candidate = solution + randn(len(bounds)) * step_size
Program 20.5: Take a random step (normal distribution)
We donŠt have to take steps in this way. You may wish to use a uniform distribution between 0
and the step size. For example:
...
candidate = solution + rand(len(bounds)) * step_size
Program 20.6: Take a random step (uniform distribution)
...
candidate_eval = objective(candidate)
Program 20.7: Evaluate candidate point
20.3 Implement Simulated Annealing 203
We then need to check if the evaluation of this new point is as good as or better than the current
best point, and if it is, replace our current best point with this new point. This is separate from
the current working solution that is the focus of the search.
...
if candidate_eval < best_eval:
# store new best point
best, best_eval = candidate, candidate_eval
# report progress
print('>%d f(%s) = %.5f' % (i, best, best_eval))
Program 20.8: Check for new best solution
Next, we need to prepare to replace the current working solution. The Ąrst step is to calculate
the difference between the objective function evaluation of the current solution and the current
working solution.
...
diff = candidate_eval - curr_eval
Program 20.9: Difference between candidate and current point evaluation
Next, we need to calculate the current temperature, using the fast annealing schedule, where
ŞtempŤ is the initial temperature provided as an argument.
...
t = temp / float(i + 1)
Program 20.10: Calculate temperature for current epoch
We can then calculate the likelihood of accepting a solution with worse performance than our
current working solution.
...
metropolis = exp(-diff / t)
Program 20.11: Calculate metropolis acceptance criterion
Finally, we can accept the new point as the current working solution if it has a better objective
function evaluation (the difference is negative) or if the objective function is worse, but we
probabilistically decide to accept it.
...
if diff < 0 or rand() < metropolis:
# store the new current point
curr, curr_eval = candidate, candidate_eval
Program 20.12: Check if we should keep the new point
Now that we know how to implement the simulated annealing algorithm in Python, letŠs look at
how we might use it to optimize an objective function.
# objective function
def objective(x):
return x[0]**2.0
20.4 Simulated Annealing Worked Example 205
Running the example creates a line plot of the objective function and clearly marks the function
optima.
Figure 20.1: Line Plot of Objective Function With Optima Marked With a Dashed Red
Line
Before we apply the optimization algorithm to the problem, letŠs take a moment to
understand the acceptance criterion a little better. First, the fast annealing schedule is an
exponential function of the number of iterations. We can make this clear by creating a plot of
the temperature for each algorithm iteration. We will use an initial temperature of 10 and 100
algorithm iterations, both arbitrarily chosen. The complete example is listed below.
initial_temp = 10
# array of iterations from 0 to iterations - 1
iterations = [i for i in range(iterations)]
# temperatures for each iterations
temperatures = [initial_temp/float(i + 1) for i in iterations]
# plot iterations vs temperatures
pyplot.plot(iterations, temperatures)
pyplot.xlabel('Iteration')
pyplot.ylabel('Temperature')
pyplot.show()
Program 20.15: Explore temperature vs algorithm iteration for simulated annealing
Running the example calculates the temperature for each algorithm iteration and creates a plot
of algorithm iteration (x-axis) vs. temperature (y-axis). We can see that temperature drops
rapidly, exponentially, not linearly, such that after 20 iterations it is below 1 and stays low for
the remainder of the search.
Figure 20.2: Line Plot of Temperature vs. Algorithm Iteration for Fast Annealing
Next, we can get a better idea of how the metropolis acceptance criterion changes over
time with the temperature. Recall that the criterion is a function of temperature, but is also a
function of how different the objective evaluation of the new point is compared to the current
working solution. As such, we will plot the criterion for a few different Şdifferences in objective
function valueŤ to see the effect it has on acceptance probability. The complete example is
listed below.
Running the example calculates the metropolis acceptance criterion for each algorithm iteration
using the temperature shown for each iteration (shown in the previous section). The plot has
three lines for three differences between the new worse solution and the current working solution.
We can see that the worse the solution is (the larger the difference), the less likely the model is
to accept the worse solution regardless of the algorithm iteration, as we might expect. We can
also see that in all cases, the likelihood of accepting worse solutions decreases with algorithm
iteration.
Figure 20.3: Line Plot of Metropolis Acceptance Criterion vs. Algorithm Iteration for
Simulated Annealing
Now that we are more familiar with the behavior of the temperature and metropolis
acceptance criterion over time, letŠs apply simulated annealing to our test problem. First, we
will seed the pseudorandom number generator. This is not required in general, but in this case,
20.4 Simulated Annealing Worked Example 208
I want to ensure we get the same results (same sequence of random numbers) each time we run
the algorithm so we can plot the results later.
...
seed(1)
Program 20.17: Seed the pseudorandom number generator
Next, we can deĄne the conĄguration of the search. In this case, we will search for 1,000
iterations of the algorithm and use a step size of 0.1. Given that we are using a Gaussian
function for generating the step, this means that about 99 percent of all steps taken will be
within a distance of (0.1 × 3) of a given point, i.e. three standard deviations. We will also use
an initial temperature of 10.0. The search procedure is more sensitive to the annealing schedule
than the initial temperature, as such, initial temperature values are almost arbitrary.
...
n_iterations = 1000
# define the maximum step size
step_size = 0.1
# initial temperature
temp = 10
Program 20.18: Set up hyperparameters
...
best, score = simulated_annealing(objective, bounds, n_iterations, step_size, temp)
print('Done!')
print('f(%s) = %f' % (best, score))
Program 20.19: Perform the simulated annealing search
# objective function
def objective(x):
return x[0]**2.0
Running the example reports the progress of the search including the iteration number, the
input to the function, and the response from the objective function each time an improvement
was detected. At the end of the search, the best solution is found and its evaluation is reported.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see about 20 improvements over the 1,000 iterations of the algorithm
and a solution that is very close to the optimal input of 0.0 that evaluates to f (0.0) = 0.0.
20.4 Simulated Annealing Worked Example 210
It can be interesting to review the progress of the search as a line plot that shows the
change in the evaluation of the best solution each time there is an improvement. We can update
the simulated_annealing() to keep track of the objective function evaluations each time there
is an improvement and return this list of scores.
We can then create a line plot of these scores to see the relative change in objective function for
each improvement found during the search.
...
pyplot.plot(scores, '.-')
pyplot.xlabel('Improvement Number')
pyplot.ylabel('Evaluation f(x)')
pyplot.show()
Program 20.22: Line plot of best scores
Tying this together, the complete example of performing the search and plotting the objective
function scores of the improved solutions during the search is listed below.
# objective function
def objective(x):
return x[0]**2.0
Running the example performs the search and reports the results as before. A line plot is
created showing the objective function evaluation for each improvement during the hill climbing
search. We can see about 20 changes to the objective function evaluation during the search with
large changes initially and very small to imperceptible changes towards the end of the search as
the algorithm converged on the optima.
20.5 Further Reading 213
Figure 20.4: Line Plot of Objective Function Evaluation for Each Improvement During
the Simulated Annealing Search
Papers
S. Kirkpatrickc, D. Gelatt, and M. P. Vecchi. ŞOptimization by Simulated AnnealingŤ. Science,
220(4598), 1983, pp. 671Ű680.
https://science.sciencemag.org/content/220/4598/671.abstract
Books
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
Sean Luke. Essentials of Metaheuristics. lulu.com, 2011.
https://amzn.to/3lHryZr
Articles
Simulated annealing. Wikipedia.
https://en.wikipedia.org/wiki/Simulated_annealing
Annealing (metallurgy). Wikipedia.
https://en.wikipedia.org/wiki/Annealing_(metallurgy)
20.6 Summary 214
20.6 Summary
In this tutorial, you discovered the simulated annealing optimization algorithm for function
optimization. SpeciĄcally, you learned:
▷ Simulated annealing is a stochastic global search algorithm for function optimization.
▷ How to implement the simulated annealing algorithm from scratch in Python.
▷ How to use the simulated annealing algorithm and inspect the results of the algorithm.
Next, we start to learn about various gradient descent algorithms.
V
Gradient Descent
Gradient Descent Optimization
from Scratch
21
Gradient descent is an optimization algorithm that follows the negative gradient of an objective
function in order to locate the minimum of the function. It is a simple and effective technique
that can be implemented with just a few lines of code. It also provides the basis for many
extensions and modiĄcations that can result in better performance. The algorithm also provides
the basis for the widely used extension called stochastic gradient descent, used to train deep
learning neural networks.
In this tutorial, you will discover how to implement gradient descent optimization from
scratch. After completing this tutorial, you will know:
▷ Gradient descent is a general procedure for optimizing a differentiable objective function.
▷ How to implement the gradient descent algorithm from scratch in Python.
▷ How to apply the gradient descent algorithm to an objective function.
LetŠs get started.
1
https://en.wikipedia.org/wiki/Gradient_descent
21.2 Gradient Descent Optimization 217
First-order methods rely on gradient information to help direct the search for a
“ minimum . . .
Ů Page 69, Algorithms for Optimization, 2019. ”
The Ąrst-order derivative, or simply the Şderivative2 ,Ť is the rate of change or slope of
the target function at a speciĄc point, e.g. for a speciĄc input. If the target function takes
multiple input variables, it is referred to as a multivariate function and the input variables can
be thought of as a vector. In turn, the derivative of a multivariate target function may also be
taken as a vector and is referred to generally as the Şgradient3 .Ť
▷ Gradient: First order derivative for a multivariate objective function.
The derivative or the gradient points in the direction of the steepest ascent of the target function
for an input.
The gradient points in the direction of steepest ascent of the tangent hyperplane . . .
2
https://en.wikipedia.org/wiki/Derivative
3
https://en.wikipedia.org/wiki/Gradient
4
https://en.wikipedia.org/wiki/Differentiable_function
21.3 Gradient Descent Algorithm 218
Gradient descent is also the basis for the optimization algorithm used to train deep learning
neural networks, referred to as stochastic gradient descent, or SGD. In this variation, the target
function is an error function and the function gradient is approximated from prediction error on
samples from the problem domain.
Now that we are familiar with a high-level idea of gradient descent optimization, letŠs look
at how we might implement the algorithm.
xnew = x − α × f ′ (x)
The steeper the objective function at a given point, the larger the magnitude of the gradient,
and in turn, the larger the step taken in the search space. The size of the step taken is scaled
using a step size hyperparameter.
▷ Step Size (α): Hyperparameter that controls how far to move in the search space against
the gradient each iteration of the algorithm.
If the step size is too small, the movement in the search space will be small and the search will
take a long time. If the step size is too large, the search may bounce around the search space
and skip over the optima.
We have the option of either taking very small steps and re-evaluating the gradient
“ at every step, or we can take large steps each time. The Ąrst approach results in
a laborious method of reaching the minimizer, whereas the second approach may
result in a more zigzag path to the minimizer.
Ů Page 114, An Introduction to Optimization, 2001. ”
Finding a good step size may take some trial and error for the speciĄc target function. The
difficulty of choosing the step size can make Ąnding the exact optima of the target function hard.
Many extensions involve adapting the learning rate over time to take smaller steps or different
sized steps in different dimensions and so on to allow the algorithm to hone in on the function
21.3 Gradient Descent Algorithm 219
optima. The process of calculating the derivative of a point and calculating a new point in the
input space is repeated until some stop condition is met. This might be a Ąxed number of steps
or target function evaluations, a lack of improvement in target function evaluation over some
number of iterations, or the identiĄcation of a Ćat (stationary) area of the search space signiĄed
by a gradient of zero.
▷ Stop Condition: Decision when to end the search procedure.
LetŠs look at how we might implement the gradient descent algorithm in Python. First, we can
deĄne an initial point as a randomly selected point in the input space deĄned by a bounds. The
bounds can be deĄned along with an objective function as an array with a min and max value
for each dimension. The rand()5 NumPy function can be used to generate a vector of random
numbers in the range 0 to 1.
...
solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
Program 21.1: Generate an initial point
We can then calculate the derivative of the point using a function named derivative().
...
gradient = derivative(solution)
Program 21.2: Calculate gradient
And take a step in the search space to a new point down the hill of the current point. The new
position is calculated using the calculated gradient and the step_size hyperparameter.
...
solution = solution - step_size * gradient
Program 21.3: Take a step
...
solution_eval = objective(solution)
Program 21.4: Evaluate candidate point
This process can be repeated for a Ąxed number of iterations controlled via an n_iter
hyperparameter.
...
for i in range(n_iter):
# calculate gradient
gradient = derivative(solution)
# take a step
solution = solution - step_size * gradient
# evaluate candidate point
5
https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.html
21.4 Gradient Descent Worked Example 220
solution_eval = objective(solution)
# report progress
print('>%d f(%s) = %.5f' % (i, solution, solution_eval))
Program 21.5: Run the gradient descent
We can tie all of this together into a function named gradient_descent(). The function takes
the name of the objective and gradient functions, as well as the bounds on the inputs to the
objective function, number of iterations and step size, then returns the solution and its evaluation
at the end of the search. The complete gradient descent optimization algorithm implemented as
a function is listed below.
Now that we are familiar with the gradient descent algorithm, letŠs look at a worked example.
def objective(x):
return x**2.0
Program 21.7: Objective function
We can then sample all inputs in the range and calculate the objective function value for each.
...
# define range for input
r_min, r_max = -1.0, 1.0
# sample input range uniformly at 0.1 increments
inputs = arange(r_min, r_max+0.1, 0.1)
21.4 Gradient Descent Worked Example 221
# compute targets
results = objective(inputs)
Program 21.8: Compute the objective function for all inputs in range
Finally, we can create a line plot of the inputs (x-axis) versus the objective function values
(y-axis) to get an intuition for the shape of the objective function that we will be searching.
...
# create a line plot of input vs result
pyplot.plot(inputs, results)
# show the plot
pyplot.show()
Program 21.9: Plot the objective function input and result
The example below ties this together and provides an example of plotting the one-dimensional
test function.
# objective function
def objective(x):
return x**2.0
Running the example creates a line plot of the inputs to the function (x-axis) and the calculated
output of the function (y-axis). We can see the familiar U-shaped called a parabola.
21.4 Gradient Descent Worked Example 222
Next, we can apply the gradient descent algorithm to the problem. First, we need a function
that calculates the derivative for this function. The derivative of x2 is 2x and the derivative()
function implements this below.
def derivative(x):
return x * 2.0
Program 21.11: Derivative of objective function
We can then deĄne the bounds of the objective function, the step size, and the number of
iterations for the algorithm. We will use a step size of 0.1 and 30 iterations, both found after a
little experimentation.
...
# define range for input
bounds = asarray([[-1.0, 1.0]])
# define the total iterations
n_iter = 30
# define the maximum step size
step_size = 0.1
# perform the gradient descent search
best, score = gradient_descent(objective, derivative, bounds, n_iter, step_size)
Program 21.12: Perform gradient descent search
Tying this together, the complete example of applying gradient descent optimization to our
one-dimensional test function is listed below.
# objective function
21.4 Gradient Descent Worked Example 223
def objective(x):
return x**2.0
Running the example starts with a random point in the search space then applies the gradient
descent algorithm, reporting performance along the way.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the algorithm Ąnds a good solution after about 20Ű30 iterations
with a function evaluation of about 0.0. Note the optima for this function is at f (0.0) = 0.0.
Now, letŠs get a feeling for the importance of good step size. Set the step size to a large value,
such as 1.0, and re-run the search.
...
step_size = 1.0
Program 21.14: DeĄne a larger step size
Run the example with the larger step size and inspect the results.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
We can see that the search does not Ąnd the optima, and instead bounces around the
domain, in this case between the values 0.64820935 and −0.64820935.
...
>25 f([0.64820935]) = 0.42018
>26 f([-0.64820935]) = 0.42018
>27 f([0.64820935]) = 0.42018
>28 f([-0.64820935]) = 0.42018
>29 f([0.64820935]) = 0.42018
21.4 Gradient Descent Worked Example 225
Done!
f([0.64820935]) = 0.420175
Output 21.2: Result from Program 21.13 with a larger step size
...
step_size = 1e-5
Program 21.15: DeĄne a smaller step size
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
Re-running the search, we can see that the algorithm moves very slowly down the slope of
the objective function from the starting point.
...
>25 f([-0.87315153]) = 0.76239
>26 f([-0.87313407]) = 0.76236
>27 f([-0.8731166]) = 0.76233
>28 f([-0.87309914]) = 0.76230
>29 f([-0.87308168]) = 0.76227
Done!
f([-0.87308168]) = 0.762272
Output 21.3: Result from Program 21.13 with a smaller step size
These two quick examples highlight the problems in selecting a step size that is too large
or too small and the general importance of testing many different step size values for a given
objective function. Finally, we can change the learning rate back to 0.1 and visualize the progress
of the search on a plot of the target function. First, we can update the gradient_descent()
function to store all solutions and their score found during the optimization as lists and return
them at the end of the search instead of the best solution found.
# store solution
solutions.append(solution)
scores.append(solution_eval)
# report progress
print('>%d f(%s) = %.5f' % (i, solution, solution_eval))
return [solutions, scores]
Program 21.16: Gradient descent algorithm with the scores stored
The function can be called, and we can get the lists of the solutions and their scores found
during the search.
...
❈
solutions, scores = gradient_descent(objective, derivative, bounds, n_iter, ❈
step_size)
Program 21.17: Perform the gradient descent search and retrieve the scores
...
# sample input range uniformly at 0.1 increments
inputs = arange(bounds[0,0], bounds[0,1]+0.1, 0.1)
# compute targets
results = objective(inputs)
# create a line plot of input vs result
pyplot.plot(inputs, results)
Program 21.18: Plot the objective function
Finally, we can plot each solution found as a red dot and connect the dots with a line so we can
see how the search moved downhill.
...
pyplot.plot(solutions, scores, '.-', color='red')
Program 21.19: Plot the solutions found on the objective function
Tying this all together, the complete example of plotting the result of the gradient descent
search on the one-dimensional test function is listed below.
# objective function
def objective(x):
return x**2.0
Running the example performs the gradient descent search on the objective function as before,
except in this case, each point found during the search is plotted.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the search started about halfway up the left part of the function
and stepped downhill to the bottom of the basin. We can see that in the parts of the objective
21.5 Further Reading 228
function with the larger curve, the derivative (gradient) is larger, and in turn, larger steps are
taken. Similarly, the gradient is smaller as we get closer to the optima, and in turn, smaller
steps are taken. This highlights that the step size is used as a scale factor on the magnitude of
the gradient (curvature) of the objective function.
Figure 21.2: Plot of the Progress of Gradient Descent on a One Dimensional Objective
Function
Books
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
Edwin K. P. Chong and Stanislaw H. Zak. An Introduction to Optimization. Wiley-Blackwell,
2001.
https://amzn.to/37S9WVs
APIs
numpy.random.rand API.
https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.
html
numpy.asarray API.
https://numpy.org/doc/stable/reference/generated/numpy.asarray.html
Matplotlib API.
https://matplotlib.org/api/pyplot_api.html
21.6 Summary 229
Articles
Gradient descent. Wikipedia.
https://en.wikipedia.org/wiki/Gradient_descent
Gradient. Wikipedia.
https://en.wikipedia.org/wiki/Gradient
Derivative. Wikipedia.
https://en.wikipedia.org/wiki/Derivative
Differentiable function. Wikipedia.
https://en.wikipedia.org/wiki/Differentiable_function
21.6 Summary
In this tutorial, you discovered how to implement gradient descent optimization from scratch.
SpeciĄcally, you learned:
▷ Gradient descent is a general procedure for optimizing a differentiable objective function.
▷ How to implement the gradient descent algorithm from scratch in Python.
▷ How to apply the gradient descent algorithm to an objective function.
Next, we will learn about a strategy that can improve gradient descent.
Gradient Descent with
Momentum
22
Gradient descent is an optimization algorithm that follows the negative gradient of an objective
function in order to locate the minimum of the function. A problem with gradient descent is
that it can bounce around the search space on optimization problems that have large amounts
of curvature or noisy gradients, and it can get stuck in Ćat spots in the search space that have
no gradient. Momentum is an extension to the gradient descent optimization algorithm that
allows the search to build inertia in a direction in the search space and overcome the oscillations
of noisy gradients and coast across Ćat spots of the search space.
In this tutorial, you will discover the gradient descent with momentum algorithm. After
completing this tutorial, you will know:
▷ Gradient descent is an optimization algorithm that uses the gradient of the objective
function to navigate the search space.
▷ Gradient descent can be accelerated by using momentum from past updates to the
search position.
▷ How to implement gradient descent optimization with momentum and develop an
intuition for its behavior.
LetŠs get started.
“ minimum . . .
Ů Page 69, Algorithms for Optimization, 2019. ”
The Ąrst-order derivative2 , or simply the Şderivative,Ť is the rate of change or slope of
the target function at a speciĄc point, e.g. for a speciĄc input. If the target function takes
multiple input variables, it is referred to as a multivariate function and the input variables can
be thought of as a vector. In turn, the derivative of a multivariate target function may also be
taken as a vector and is referred to generally as the Şgradient3 .Ť
▷ Gradient: First-order derivative for a multivariate objective function.
The derivative or the gradient points in the direction of the steepest ascent of the target function
for a speciĄc input. Gradient descent refers to a minimization optimization algorithm that
follows the negative of the gradient downhill of the target function to locate the minimum of the
function. The gradient descent algorithm requires a target function that is being optimized and
the derivative function for the objective function. The target function f () returns a score for a
given set of inputs, and the derivative function f ′ () gives the derivative of the target function
for a given set of inputs. The gradient descent algorithm requires a starting point (x) in the
problem, such as a randomly selected point in the input space. The derivative is then calculated
and a step is taken in the input space that is expected to result in a downhill movement in the
target function, assuming we are minimizing the target function. A downhill movement is made
by Ąrst calculating how far to move in the input space, calculated as the step size (called α or
the learning rate) multiplied by the gradient. This is then subtracted from the current point,
ensuring we move against the gradient, or down the target function.
xnew = x − α × f ′ (x)
The steeper the objective function at a given point, the larger the magnitude of the gradient
and, in turn, the larger the step taken in the search space. The size of the step taken is scaled
using a step size hyperparameter.
▷ Step Size (α): Hyperparameter that controls how far to move in the search space against
the gradient each iteration of the algorithm, also called the learning rate.
If the step size is too small, the movement in the search space will be small and the search will
take a long time. If the step size is too large, the search may bounce around the search space
and skip over the optima.
Now that we are familiar with the gradient descent optimization algorithm, letŠs take a
look at momentum.
1
https://en.wikipedia.org/wiki/Gradient_descent
2
https://en.wikipedia.org/wiki/Derivative
3
https://en.wikipedia.org/wiki/Gradient
22.3 Momentum 232
22.3 Momentum
Momentum is an extension to the gradient descent optimization algorithm, often referred to
as gradient descent with momentum. It is designed to accelerate the optimization process,
e.g. decrease the number of function evaluations required to reach the optima, or to improve
the capability of the optimization algorithm, e.g. result in a better Ąnal result. A problem with
the gradient descent algorithm is that the progression of the search can bounce around the
search space based on the gradient. For example, the search may progress downhill towards the
minima, but during this progression, it may move in another direction, even uphill, depending
on the gradient of speciĄc points (sets of parameters) encountered during the search.
This can slow down the progress of the search, especially for those optimization problems
where the broader trend or shape of the search space is more useful than speciĄc gradients along
the way. One approach to this problem is to add history to the parameter update equation based
on the gradient encountered in the previous updates. This change is based on the metaphor
of momentum from physics where acceleration in a direction can be accumulated from past
updates.
The name momentum derives from a physical analogy, in which the negative gradient
xnew = x − ∆x
Momentum involves maintaining the change in the position and using it in the subsequent
calculation of the change in position. If we think of updates over time, then the update at the
current iteration or time (t) will add the change used at the previous time (t − 1) weighted by
the momentum hyperparameter η, as follows:
The change in the position accumulates magnitude and direction of changes over the iterations
of the search, proportional to the size of the momentum hyperparameter. For example, a large
22.4 Gradient Descent with Momentum 233
momentum (e.g. 0.9) will mean that the update is strongly inĆuenced by the previous update,
whereas a modest momentum (0.2) will mean very little inĆuence.
The momentum algorithm accumulates an exponentially decaying moving average
“ it damps the size of the steps along directions of high curvature thus yielding a larger
effective learning rate along the directions of low curvature.
Ů Page 21, Neural Networks: Tricks of the Trade, 2012.
Momentum is most useful in optimization problems where the objective function has a
”
large amount of curvature (i.e. changes a lot), meaning that the gradient may change a lot over
relatively small regions of the search space.
The method of momentum is designed to accelerate learning, especially in the face
def objective(x):
return x**2.0
Program 22.1: Objective function
We can then sample all inputs in the range and calculate the objective function value for each.
22.4 Gradient Descent with Momentum 234
...
# define range for input
r_min, r_max = -1.0, 1.0
# sample input range uniformly at 0.1 increments
inputs = arange(r_min, r_max+0.1, 0.1)
# compute targets
results = objective(inputs)
Program 22.2: Compute the objective function for all inputs in range
Finally, we can create a line plot of the inputs (x-axis) versus the objective function values
(y-axis) to get an intuition for the shape of the objective function that we will be searching.
...
# create a line plot of input vs result
pyplot.plot(inputs, results)
# show the plot
pyplot.show()
Program 22.3: Plot the objective function input and result
The example below ties this together and provides an example of plotting the one-dimensional
test function.
# objective function
def objective(x):
return x**2.0
Running the example creates a line plot of the inputs to the function (x-axis) and the calculated
output of the function (y-axis). We can see the familiar U-shape called a parabola.
22.5 Gradient Descent Optimization 235
def derivative(x):
return x * 2.0
Program 22.5: Derivative of objective function
We can deĄne a function that implements the gradient descent optimization algorithm. The
procedure involves starting with a randomly selected point in the search space, then calculating
the gradient, updating the position in the search space, evaluating the new position, and
reporting the progress. This process is then repeated for a Ąxed number of iterations. The Ąnal
point and its evaluation are then returned from the function.
The function gradient_descent() below implements this and takes the name of the
objective and gradient functions as well as the bounds on the inputs to the objective function,
number of iterations, and step size, then returns the solution and its evaluation at the end of
the search.
# take a step
solution = solution - step_size * gradient
# evaluate candidate point
solution_eval = objective(solution)
# report progress
print('>%d f(%s) = %.5f' % (i, solution, solution_eval))
return [solution, solution_eval]
Program 22.6: Gradient descent algorithm
We can then deĄne the bounds of the objective function, the step size, and the number of
iterations for the algorithm. We will use a step size of 0.1 and 30 iterations, both found after a
little experimentation. The seed for the pseudorandom number generator is Ąxed so that we
always get the same sequence of random numbers, and in this case, it ensures that we get the
same starting point for the search each time the code is run (e.g. something interesting far from
the optima).
...
# seed the pseudo random number generator
seed(4)
# define range for input
bounds = asarray([[-1.0, 1.0]])
# define the total iterations
n_iter = 30
# define the maximum step size
step_size = 0.1
# perform the gradient descent search
best, score = gradient_descent(objective, derivative, bounds, n_iter, step_size)
Program 22.7: Perform gradient descent search
Tying this together, the complete example of applying grid search to our one-dimensional test
function is listed below.
# objective function
def objective(x):
return x**2.0
gradient = derivative(solution)
# take a step
solution = solution - step_size * gradient
# evaluate candidate point
solution_eval = objective(solution)
# report progress
print('>%d f(%s) = %.5f' % (i, solution, solution_eval))
return [solution, solution_eval]
Running the example starts with a random point in the search space, then applies the gradient
descent algorithm, reporting performance along the way.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the algorithm Ąnds a good solution after about 27 iterations,
with a function evaluation of about 0.0. Note the optima for this function is at f (0.0) = 0.0. We
would expect that gradient descent with momentum will accelerate the optimization procedure
and Ąnd a similarly evaluated solution in fewer iterations.
The function can be called and we can get the lists of the solutions and the scores found during
the search.
22.6 Visualization of Gradient Descent Optimization 239
...
❈
solutions, scores = gradient_descent(objective, derivative, bounds, n_iter, ❈
step_size)
Program 22.10: Perform the gradient descent search and retrieve the scores
...
# sample input range uniformly at 0.1 increments
inputs = arange(bounds[0,0], bounds[0,1]+0.1, 0.1)
# compute targets
results = objective(inputs)
# create a line plot of input vs result
pyplot.plot(inputs, results)
Program 22.11: Plot the objective function
Finally, we can plot each solution found as a red dot and connect the dots with a line so we can
see how the search moved downhill.
...
pyplot.plot(solutions, scores, '.-', color='red')
Program 22.12: Plot the solutions found on the objective function
Tying this all together, the complete example of plotting the result of the gradient descent
search on the one-dimensional test function is listed below.
# objective function
def objective(x):
return x**2.0
Running the example performs the gradient descent search on the objective function as before,
except in this case, each point found during the search is plotted.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the search started more than halfway up the right part of
the function and stepped downhill to the bottom of the basin. We can see that in the parts
of the objective function with the larger curve, the derivative (gradient) is larger, and in turn,
larger steps are taken. Similarly, the gradient is smaller as we get closer to the optima, and in
turn, smaller steps are taken. This highlights that the step size is used as a scale factor on the
magnitude of the gradient (curvature) of the objective function.
22.7 Gradient Descent Optimization With Momentum 241
Figure 22.2: Plot of the Progress of Gradient Descent on a One Dimensional Objective
Function
...
change = 0.0
Program 22.14: Keep track of the change
We can then break the update procedure down into Ąrst calculating the gradient, then calculating
the change to the solution, calculating the position of the new solution, then saving the change
for the next iteration.
...
# calculate gradient
gradient = derivative(solution)
# calculate update
new_change = step_size * gradient + momentum * change
# take a step
solution = solution - new_change
# save the change
change = new_change
Program 22.15: Update procedure with the change saved for the next iteration
The updated version of the gradient_descent() function with these changes is listed below.
22.7 Gradient Descent Optimization With Momentum 242
We can then choose a momentum value and pass it to the gradient_descent() function. After
a little trial and error, a momentum value of 0.3 was found to be effective on this problem, given
the Ąxed step size of 0.1.
...
# define momentum
momentum = 0.3
# perform the gradient descent search with momentum
❈
best, score = gradient_descent(objective, derivative, bounds, n_iter, step_size, ❈
momentum)
Program 22.17: Gradient descent search with momentum
Tying this together, the complete example of gradient descent optimization with momentum is
listed below.
# objective function
def objective(x):
return x**2.0
Running the example starts with a random point in the search space, then applies the gradient
descent algorithm with momentum, reporting performance along the way.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the algorithm Ąnds a good solution after about 13 iterations,
with a function evaluation of about 0.0. As expected, this is faster (fewer iterations) than
gradient descent without momentum, using the same starting point and step size that took 27
iterations.
22.8 Visualization of Gradient Descent Optimization With Momentum 244
# objective function
def objective(x):
return x**2.0
22.8 Visualization of Gradient Descent Optimization With Momentum 245
Running the example performs the gradient descent search with momentum on the objective
function as before, except in this case, each point found during the search is plotted.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, if we compare the plot to the plot created previously for the performance of
gradient descent (without momentum), we can see that the search indeed reaches the optima in
fewer steps, noted with fewer distinct red dots on the path to the bottom of the basin.
Figure 22.3: Plot of the Progress of Gradient Descent With Momentum on a One
Dimensional Objective Function
As an extension, try different values for momentum, such as 0.8, and review the resulting
plot.
Books
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
https://amzn.to/3qSk3C2
22.10 Summary 247
Grégoire Montavon, Geneviève Orr, and Klaus-Robert Müller. Neural Networks: Tricks of the
Trade. 2nd ed. Springer, 2012.
https://amzn.to/3ac5S4Q
Russell Reed. Neural Smithing: Supervised Learning in Feedforward ArtiĄcial Neural Networks.
Bradford Books, 1999.
https://amzn.to/380Yjvd
Christopher M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, 1995.
https://amzn.to/3nFrjyF
APIs
numpy.random.rand API.
https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.
html
numpy.asarray API.
https://numpy.org/doc/stable/reference/generated/numpy.asarray.html
Matplotlib API.
https://matplotlib.org/api/pyplot_api.html
Articles
Gradient descent. Wikipedia.
https://en.wikipedia.org/wiki/Gradient_descent
Stochastic gradient descent. Wikipedia.
https://en.wikipedia.org/wiki/Stochastic_gradient_descent
Gradient. Wikipedia.
https://en.wikipedia.org/wiki/Gradient
Derivative. Wikipedia.
https://en.wikipedia.org/wiki/Derivative
Differentiable function. Wikipedia.
https://en.wikipedia.org/wiki/Differentiable_function
22.10 Summary
In this tutorial, you discovered the gradient descent with momentum algorithm. SpeciĄcally,
you learned:
▷ Gradient descent is an optimization algorithm that uses the gradient of the objective
function to navigate the search space.
▷ Gradient descent can be accelerated by using momentum from past updates to the
search position.
▷ How to implement gradient descent optimization with momentum and develop an
intuition for its behavior.
Next, we will learn about a variation of gradient descent algorithm that can automatically
adjust the step size.
Gradient Descent with AdaGrad
23
Gradient descent is an optimization algorithm that follows the negative gradient of an objective
function in order to locate the minimum of the function. A limitation of gradient descent is
that it uses the same step size (learning rate) for each input variable. This can be a problem
on objective functions that have different amounts of curvature in different dimensions, and in
turn, may require a different sized step to a new point. Adaptive Gradients, or AdaGrad for
short, is an extension of the gradient descent optimization algorithm that allows the step size in
each dimension used by the optimization algorithm to be automatically adapted based on the
gradients seen for the variable (partial derivatives) seen over the course of the search.
In this tutorial, you will discover how to develop the gradient descent with adaptive gradients
optimization algorithm from scratch. After completing this tutorial, you will know:
▷ Gradient descent is an optimization algorithm that uses the gradient of the objective
function to navigate the search space.
▷ Gradient descent can be updated to use an automatically adaptive step size for each
input variable in the objective function, called adaptive gradients or AdaGrad.
▷ How to implement the AdaGrad optimization algorithm from scratch and apply it to
an objective function and evaluate the results.
LetŠs get started.
“ minimum . . .
Ů Page 69, Algorithms for Optimization, 2019. ”
The Ąrst order derivative2 , or simply the Şderivative,Ť is the rate of change or slope of
the target function at a speciĄc point, e.g. for a speciĄc input. If the target function takes
multiple input variables, it is referred to as a multivariate function and the input variables can
be thought of as a vector. In turn, the derivative of a multivariate target function may also be
taken as a vector and is referred to generally as the Şgradient3 .Ť
▷ Gradient: First-order derivative for a multivariate objective function.
The derivative or the gradient points in the direction of the steepest ascent of the target function
for a speciĄc input. Gradient descent refers to a minimization optimization algorithm that
follows the negative of the gradient downhill of the target function to locate the minimum of
the function.
The gradient descent algorithm requires a target function that is being optimized and the
derivative function for the objective function. The target function f () returns a score for a given
set of inputs, and the derivative function f ′ () gives the derivative of the target function for a
given set of inputs. The gradient descent algorithm requires a starting point (x) in the problem,
such as a randomly selected point in the input space. The derivative is then calculated and a
step is taken in the input space that is expected to result in a downhill movement in the target
function, assuming we are minimizing the target function. A downhill movement is made by
Ąrst calculating how far to move in the input space, calculated as the step size (called α or
the learning rate) multiplied by the gradient. This is then subtracted from the current point,
ensuring we move against the gradient, or down the target function.
xnew = x − α × f ′ (x)
The steeper the objective function at a given point, the larger the magnitude of the gradient,
and in turn, the larger the step taken in the search space. The size of the step taken is scaled
using a step size hyperparameter.
▷ Step Size (α): Hyperparameter that controls how far to move in the search space against
the gradient each iteration of the algorithm.
If the step size is too small, the movement in the search space will be small and the search will
take a long time. If the step size is too large, the search may bounce around the search space
and skip over the optima.
Now that we are familiar with the gradient descent optimization algorithm, letŠs take a
look at AdaGrad.
1
https://en.wikipedia.org/wiki/Gradient_descent
2
https://en.wikipedia.org/wiki/Derivative
3
https://en.wikipedia.org/wiki/Gradient
23.3 Adaptive Gradient (AdaGrad) 250
“ rapid decrease in their learning rate, while parameters with small partial derivatives
have a relatively small decrease in their learning rate.
Ů Page 307, Deep Learning, 2016.
A problem with the gradient descent algorithm is that the step size (learning rate) is the
”
same for each variable or dimension in the search space. It is possible that better performance
can be achieved using a step size that is tailored to each variable, allowing larger movements in
dimensions with a consistently steep gradient and smaller movements in dimensions with less
steep gradients. AdaGrad is designed to speciĄcally explore the idea of automatically tailoring
the step size for each dimension in the search space.
The adaptive subgradient method, or Adagrad, adapts a learning rate for each
“ component of x
Ů Page 77, Algorithms for Optimization, 2019. ”
This is achieved by Ąrst calculating a step size for a given dimension, then using the
calculated step size to make a movement in that dimension using the partial derivative. This
process is then repeated for each dimension in the search space.
Adagrad dulls the inĆuence of parameters with consistently high gradients, thereby
4
https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
23.4 Gradient Descent With AdaGrad 251
This process is then repeated for each input variable until a new point in the search space
is created and can be evaluated. Importantly, the partial derivative for the current solution
(iteration of the search) is included in the sum of the square root of partial derivatives. We could
maintain an array of partial derivatives or squared partial derivatives for each input variable,
but this is not necessary. Instead, we simply maintain the sum of the squared partial derivatives
and add new values to this sum along the way. Now that we are familiar with the AdaGrad
algorithm, letŠs explore how we might implement it and evaluate its performance.
We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the
response surface. The complete example of plotting the objective function is listed below.
# objective function
def objective(x, y):
return x**2.0 + y**2.0
Running the example creates a three-dimensional surface plot of the objective function. We can
see the familiar bowl shape with the global minima at f (0, 0) = 0.
We can also create a two-dimensional plot of the function. This will be helpful later when
we want to plot the progress of the search. The example below creates a contour plot of the
objective function.
23.4 Gradient Descent With AdaGrad 253
# objective function
def objective(x, y):
return x**2.0 + y**2.0
Running the example creates a two-dimensional contour plot of the objective function. We can
see the bowl shape compressed to contours shown with a color gradient. We will use this plot
to plot the speciĄc points explored during the progress of the search.
Now that we have a test objective function, letŠs look at how we might implement the
AdaGrad optimization algorithm.
23.4 Gradient Descent With AdaGrad 254
f (x) = x2
f ′ (x) = 2 × x
The derivative of x2 is 2x in each dimension. The derivative() function implements this below.
Next, we can implement gradient descent with adaptive gradients. First, we can select a random
point in the bounds of the problem as a starting point for the search. This assumes we have
an array that deĄnes the bounds of the search with one row for each dimension and the Ąrst
column deĄnes the minimum and the second column deĄnes the maximum of the dimension.
...
solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
Program 23.5: Generate an initial point
Next, we need to initialize the sum of the squared partial derivatives for each dimension to 0.0
values.
...
sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]
Program 23.6: List of the sum square gradients for each variable
We can then enumerate a Ąxed number of iterations of the search optimization algorithm deĄned
by a Şn_iterŤ hyperparameter.
...
for it in range(n_iter):
...
Program 23.7: Run the gradient descent
The Ąrst step is to calculate the gradient for the current solution using the derivative()
function.
...
gradient = derivative(solution[0], solution[1])
Program 23.8: Calculate gradient
We then need to calculate the square of the partial derivative of each variable and add them to
the running sum of these values.
23.4 Gradient Descent With AdaGrad 255
...
for i in range(gradient.shape[0]):
sq_grad_sums[i] += gradient[i]**2.0
Program 23.9: Update the sum of the squared partial derivatives
We can then use the sum squared partial derivatives and gradient to calculate the next point.
We will do this one variable at a time, Ąrst calculating the step size for the variable, then the
new value for the variable. These values are built up in an array until we have a completely
new solution that is in the steepest descent direction from the current point using the custom
step sizes.
...
new_solution = list()
for i in range(solution.shape[0]):
# calculate the step size for this variable
alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))
# calculate the new position in this variable
value = solution[i] - alpha * gradient[i]
# store this variable
new_solution.append(value)
Program 23.10: Build a solution one variable at a time
This new solution can then be evaluated using the objective() function and the performance
of the search can be reported.
...
# evaluate candidate point
solution = asarray(new_solution)
solution_eval = objective(solution[0], solution[1])
# report progress
print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
Program 23.11: Evaluate candidate point and report progress
for i in range(gradient.shape[0]):
sq_grad_sums[i] += gradient[i]**2.0
# build a solution one variable at a time
new_solution = list()
for i in range(solution.shape[0]):
# calculate the step size for this variable
alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))
# calculate the new position in this variable
value = solution[i] - alpha * gradient[i]
# store this variable
new_solution.append(value)
# evaluate candidate point
solution = asarray(new_solution)
solution_eval = objective(solution[0], solution[1])
# report progress
print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
return [solution, solution_eval]
Program 23.12: Gradient descent algorithm with AdaGrad
Note: We are using simple Python lists and imperative programming style instead
ò of NumPy arrays or list compressions intentionally to make the code more readable
for Python beginners.
We can then deĄne our hyperparameters and call the adagrad() function to optimize our
test objective function. In this case, we will use 50 iterations of the algorithm and an initial
learning rate of 0.1, both chosen after a little trial and error.
...
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# define the step size
step_size = 0.1
# perform the gradient descent search with adagrad
best, score = adagrad(objective, derivative, bounds, n_iter, step_size)
print('Done!')
print('f(%s) = %f' % (best, score))
Program 23.13: Perform gradient descent search with AdaGrad
Tying all of this together, the complete example of gradient descent optimization with adaptive
gradients is listed below.
23.4 Gradient Descent With AdaGrad 257
# objective function
def objective(x, y):
return x**2.0 + y**2.0
Running the example applies the AdaGrad optimization algorithm to our test problem and
reports the performance of the search for each iteration of the algorithm.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that a near-optimal solution was found after perhaps 35 iterations
of the search, with input values near 0.0 and 0.0, evaluating to 0.0.
We can then execute the search as before, and this time retrieve the list of solutions instead of
the best Ąnal solution.
...
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# define the step size
step_size = 0.1
# perform the gradient descent search
solutions = adagrad(objective, derivative, bounds, n_iter, step_size)
Program 23.16: Perform the gradient descent search
...
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
Program 23.17: Create contour plot of the objective function
Finally, we can plot each solution found during the search as a white dot connected by a line.
...
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')
Program 23.18: Plot the samples as black circles
Tying this all together, the complete example of performing the AdaGrad optimization on the
test problem and plotting the results on a contour plot is listed below.
# objective function
def objective(x, y):
23.4 Gradient Descent With AdaGrad 261
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')
# show the plot
pyplot.show()
Program 23.19: Example of plotting the AdaGrad search on a contour plot of the
test function
Running the example performs the search as before, except in this case, a contour plot of the
objective function is created and a white dot is shown for each solution found during the search,
starting above the optima and progressively getting closer to the optima at the center of the
plot.
Figure 23.3: Contour Plot of the Test Objective Function with AdaGrad Search Results
Shown
Papers
John Duchi, Elad Haza, and Yoram Singer. ŞAdaptive Subgradient Methods for Online
Learning and Stochastic OptimizationŤ. Journal of Machine Learning Research, 12(61),
2011, pp. 2121Ű2159.
https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
23.6 Summary 263
Books
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
https://amzn.to/3qSk3C2
APIs
numpy.random.rand API.
https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.
html
numpy.asarray API.
https://numpy.org/doc/stable/reference/generated/numpy.asarray.html
Matplotlib API.
https://matplotlib.org/api/pyplot_api.html
Articles
Gradient descent. Wikipedia.
https://en.wikipedia.org/wiki/Gradient_descent
Stochastic gradient descent. Wikipedia.
https://en.wikipedia.org/wiki/Stochastic_gradient_descent
An overview of gradient descent optimization algorithms. 2016.
https://ruder.io/optimizing-gradient-descent/index.html
23.6 Summary
In this tutorial, you discovered how to develop the gradient descent with adaptive gradients
optimization algorithm from scratch. SpeciĄcally, you learned:
▷ Gradient descent is an optimization algorithm that uses the gradient of the objective
function to navigate the search space.
▷ Gradient descent can be updated to use an automatically adaptive step size for each
input variable in the objective function, called adaptive gradients or AdaGrad.
▷ How to implement the AdaGrad optimization algorithm from scratch and apply it to
an objective function and evaluate the results.
Next, we will learn about another variation of gradient descent that is very similar to
AdaGrad.
Gradient Descent with RMSProp
24
Gradient descent is an optimization algorithm that follows the negative gradient of an objective
function in order to locate the minimum of the function. A limitation of gradient descent is
that it uses the same step size (learning rate) for each input variable. AdaGrad, for short,
is an extension of the gradient descent optimization algorithm that allows the step size in
each dimension used by the optimization algorithm to be automatically adapted based on the
gradients seen for the variable (partial derivatives) over the course of the search. A limitation of
AdaGrad is that it can result in a very small step size for each parameter by the end of the
search that can slow the progress of the search down too much and may mean not locating the
optima. Root Mean Squared Propagation, or RMSProp, is an extension of gradient descent and
the AdaGrad version of gradient descent that uses a decaying average of partial gradients in the
adaptation of the step size for each parameter. The use of a decaying moving average allows the
algorithm to forget early gradients and focus on the most recently observed partial gradients
seen during the progress of the search, overcoming the limitation of AdaGrad.
In this tutorial, you will discover how to develop the gradient descent with RMSProp
optimization algorithm from scratch. After completing this tutorial, you will know:
▷ Gradient descent is an optimization algorithm that uses the gradient of the objective
function to navigate the search space.
▷ Gradient descent can be updated to use an automatically adaptive step size for each input
variable using a decaying movingÂă average of partial derivatives, called RMSProp.
▷ How to implement the RMSProp optimization algorithm from scratch and apply it to
an objective function and evaluate the results.
LetŠs get started.
“ minimum . . .
Ů Page 69, Algorithms for Optimization, 2019. ”
The Ąrst order derivative2 , or simply the Şderivative,Ť is the rate of change or slope of
the target function at a speciĄc point, e.g. for a speciĄc input. If the target function takes
multiple input variables, it is referred to as a multivariate function and the input variables can
be thought of as a vector. In turn, the derivative of a multivariate target function may also be
taken as a vector and is referred to generally as the gradient3 .
▷ Gradient: First order derivative for a multivariate objective function.
The derivative or the gradient points in the direction of the steepest ascent of the target function
for a speciĄc input. Gradient descent refers to a minimization optimization algorithm that
follows the negative of the gradient downhill of the target function to locate the minimum of the
function. The gradient descent algorithm requires a target function that is being optimized and
the derivative function for the objective function. The target function f () returns a score for a
given set of inputs, and the derivative function f ′ () gives the derivative of the target function
for a given set of inputs. The gradient descent algorithm requires a starting point (x) in the
problem, such as a randomly selected point in the input space. The derivative is then calculated
and a step is taken in the input space that is expected to result in a downhill movement in the
target function, assuming we are minimizing the target function. A downhill movement is made
by Ąrst calculating how far to move in the input space, calculated as the step size (called α or
the learning rate) multiplied by the gradient. This is then subtracted from the current point,
ensuring we move against the gradient, or down the target function.
xnew = x − α × f ′ (x)
The steeper the objective function at a given point, the larger the magnitude of the gradient,
and in turn, the larger the step taken in the search space. The size of the step taken is scaled
using a step size hyperparameter.
▷ Step Size (α): Hyperparameter that controls how far to move in the search space against
the gradient each iteration of the algorithm.
1
https://en.wikipedia.org/wiki/Gradient_descent
2
https://en.wikipedia.org/wiki/Derivative
3
https://en.wikipedia.org/wiki/Gradient
24.3 Root Mean Squared Propagation (RMSProp) 266
If the step size is too small, the movement in the search space will be small and the search will
take a long time. If the step size is too large, the search may bounce around the search space
and skip over the optima.
Now that we are familiar with the gradient descent optimization algorithm, letŠs take a
look at RMSProp.
“ gradient and may have made the learning rate too small before arriving at such a
convex structure.
Ů Pages 307Ű308, Deep Learning, 2016.
A problem with AdaGrad is that it can slow the search down too much, resulting in very
”
small learning rates for each parameter or dimension of the search by the end of the run. This
has the effect of stopping the search too soon, before the minimal can be located.
RMSProp extends Adagrad to avoid the effect of a monotonically decreasing learning
“ rate.
Ů Page 78, Algorithms for Optimization, 2019. ”
4
http://www.cs.toronto.edu/~hinton/coursera/lecture6/lec6.pdf
24.3 Root Mean Squared Propagation (RMSProp) 267
“ past so that it can converge rapidly after Ąnding a convex bowl, as if it were an
instance of the AdaGrad algorithm initialized within that bowl.
Ů Page 308, Deep Learning, 2016.
The calculation of the mean squared partial derivative for one parameter is as follows:
”
s(t + 1) = s(t) × ρ + f ′ (x(t))2 × (1 − ρ)
Where s(t + 1) is the decaying moving average of the squared partial derivative for one parameter
for the current iteration of the algorithm, s(t) is the decaying moving average squared partial
derivative for the previous iteration, f ′ (x(t))2 is the squared partial derivative for the current
parameter, and ρ is a hyperparameter, typically with the value of 0.9 like momentum. Given
that we are using a decaying average of the partial derivatives and calculating the square root of
this average gives the technique its name, i.e., square root of the mean squared partial derivatives
or root mean square (RMS). For example, with the initial step size as η, the custom step size α
for a parameter may be written as:
η
α(t + 1) = q
10−8 + s(t + 1)
Once we have the custom step size for the parameter, we can update the parameter using the
custom step size and the partial derivative f ′ (x(t)).
This process is then repeated for each input variable until a new point in the search space is
created and can be evaluated. RMSProp is a very effective extension of gradient descent and is
one of the preferred approaches generally used to Ąt deep learning neural networks.
Empirically, RMSProp has been shown to be an effective and practical optimization
“ algorithm for deep neural networks. It is currently one of the go-to optimization
methods being employed routinely by deep learning practitioners.
Ů Page 308, Deep Learning, 2016.
Now that we are familiar with the RMSprop algorithm, letŠs explore how we might implement
”
it and evaluate its performance.
24.4 Gradient Descent With RMSProp 268
We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the
response surface. The complete example of plotting the objective function is listed below.
# objective function
def objective(x, y):
return x**2.0 + y**2.0
Running the example creates a three-dimensional surface plot of the objective function. We can
see the familiar bowl shape with the global minima at f (0, 0) = 0.
24.4 Gradient Descent With RMSProp 269
We can also create a two-dimensional plot of the function. This will be helpful later when
we want to plot the progress of the search. The example below creates a contour plot of the
objective function.
# objective function
def objective(x, y):
return x**2.0 + y**2.0
Running the example creates a two-dimensional contour plot of the objective function. We can
see the bowl shape compressed to contours shown with a color gradient. We will use this plot
to plot the speciĄc points explored during the progress of the search.
24.4 Gradient Descent With RMSProp 270
Now that we have a test objective function, letŠs look at how we might implement the
RMSProp optimization algorithm.
f (x) = x2
f ′ (x) = 2 × x
The derivative of x2 is 2x in each dimension. The derivative() function implements this below.
Next, we can implement gradient descent optimization. First, we can select a random point in
the bounds of the problem as a starting point for the search. This assumes we have an array
that deĄnes the bounds of the search with one row for each dimension and the Ąrst column
deĄnes the minimum and the second column deĄnes the maximum of the dimension.
...
solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
Program 24.5: Generate an initial point
Next, we need to initialize the decay average of the squared partial derivatives for each dimension
to 0.0 values.
24.4 Gradient Descent With RMSProp 271
...
sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]
Program 24.6: List of the average square gradients for each variable
We can then enumerate a Ąxed number of iterations of the search optimization algorithm deĄned
by a Şn_iterŤ hyperparameter.
...
for it in range(n_iter):
...
Program 24.7: Run the gradient descent
The Ąrst step is to calculate the gradient for the current solution using the derivative()
function.
...
gradient = derivative(solution[0], solution[1])
Program 24.8: Calculate gradient
We then need to calculate the square of the partial derivative and update the decaying average
of the squared partial derivatives with the ŞrhoŤ hyperparameter.
...
for i in range(gradient.shape[0]):
# calculate the squared gradient
sg = gradient[i]**2.0
# update the moving average of the squared gradient
sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))
Program 24.9: Update the average of the squared partial derivatives
We can then use the moving average of the squared partial derivatives and gradient to calculate
the step size for the next point. We will do this one variable at a time, Ąrst calculating the
step size for the variable, then the new value for the variable. These values are built up in an
array until we have a completely new solution that is in the steepest descent direction from the
current point using the custom step sizes.
...
new_solution = list()
for i in range(solution.shape[0]):
# calculate the step size for this variable
alpha = step_size / (1e-8 + sqrt(sq_grad_avg[i]))
# calculate the new position in this variable
value = solution[i] - alpha * gradient[i]
# store this variable
new_solution.append(value)
Program 24.10: Build a solution one variable at a time
This new solution can then be evaluated using the objective() function and the performance
of the search can be reported.
24.4 Gradient Descent With RMSProp 272
...
solution = asarray(new_solution)
solution_eval = objective(solution[0], solution[1])
# report progress
print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
Program 24.11: Evaluate candidate point
Note: We are using simple Python lists and imperative programming style instead
ò of NumPy arrays or list compressions intentionally to make the code more readable
for Python beginners.
24.4 Gradient Descent With RMSProp 273
We can then deĄne our hyperparameters and call the rmsprop() function to optimize our
test objective function. In this case, we will use 50 iterations of the algorithm, an initial learning
rate of 0.01, and a value of 0.99 for the rho hyperparameter, all chosen after a little trial and
error.
...
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# define the step size
step_size = 0.01
# momentum for rmsprop
rho = 0.99
# perform the gradient descent search with rmsprop
best, score = rmsprop(objective, derivative, bounds, n_iter, step_size, rho)
print('Done!')
print('f(%s) = %f' % (best, score))
Output 24.1: Perform gradient descent search with RMSProp
Tying all of this together, the complete example of gradient descent optimization with RMSProp
is listed below.
# objective function
def objective(x, y):
return x**2.0 + y**2.0
Running the example applies the RMSProp optimization algorithm to our test problem and
reports the performance of the search for each iteration of the algorithm.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that a near optimal solution was found after perhaps 33 iterations
of the search, with input values near 0.0 and 0.0, evaluating to 0.0.
...
>30 f([-9.61030898e-14 3.19352553e-03]) = 0.00001
>31 f([-3.42767893e-14 2.71513758e-03]) = 0.00001
>32 f([-1.21143047e-14 2.30636623e-03]) = 0.00001
>33 f([-4.24204875e-15 1.95738936e-03]) = 0.00000
>34 f([-1.47154482e-15 1.65972553e-03]) = 0.00000
24.4 Gradient Descent With RMSProp 275
new_solution.append(value)
# store the new solution
solution = asarray(new_solution)
solutions.append(solution)
# evaluate candidate point
solution_eval = objective(solution[0], solution[1])
# report progress
print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
return solutions
Program 24.14: Gradient descent algorithm with RMSProp
We can then execute the search as before, and this time retrieve the list of solutions instead of
the best Ąnal solution.
...
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# define the step size
step_size = 0.01
# momentum for rmsprop
rho = 0.99
# perform the gradient descent search with rmsprop
solutions = rmsprop(objective, derivative, bounds, n_iter, step_size, rho)
Program 24.15: Perform the gradient descent search with RMSProp
...
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
Program 24.16: Contour plot of the objective function
Finally, we can plot each solution found during the search as a white dot connected by a line.
...
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')
Program 24.17: Plot the sample as black circles
24.4 Gradient Descent With RMSProp 277
Tying this all together, the complete example of performing the RMSProp optimization on the
test problem and plotting the results on a contour plot is listed below.
# objective function
def objective(x, y):
return x**2.0 + y**2.0
return solutions
Running the example performs the search as before, except in this case, the contour plot of
the objective function is created. In this case, we can see that a white dot is shown for each
solution found during the search, starting above the optima and progressively getting closer to
the optima at the center of the plot.
24.5 Further Reading 279
Figure 24.3: Contour plot of the test objective function with RMSProp search results
shown
Papers
Geoffrey Hinton. ŞLecture 6e, rmsprop: Divide the gradient by a running average of its recent
magnitudeŤ. Lecture Notes, Neural Networks for Machine Learning.
http://www.cs.toronto.edu/~hinton/coursera/lecture6/lec6.pdf
Books
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
https://amzn.to/3qSk3C2
APIs
numpy.random.rand API.
https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.
html
numpy.asarray API.
https://numpy.org/doc/stable/reference/generated/numpy.asarray.html
Matplotlib API.
https://matplotlib.org/api/pyplot_api.html
24.6 Summary 280
Articles
Gradient descent. Wikipedia.
https://en.wikipedia.org/wiki/Gradient_descent
Stochastic gradient descent. Wikipedia.
https://en.wikipedia.org/wiki/Stochastic_gradient_descent
An overview of gradient descent optimization algorithms. 2016.
https://ruder.io/optimizing-gradient-descent/index.html
24.6 Summary
In this tutorial, you discovered how to develop gradient descent with RMSProp optimization
algorithm from scratch. SpeciĄcally, you learned:
▷ Gradient descent is an optimization algorithm that uses the gradient of the objective
function to navigate the search space.
▷ Gradient descent can be updated to use an automatically adaptive step size for each
input variable using a decaying average of partial derivatives, called RMSProp.
▷ How to implement the RMSProp optimization algorithm from scratch and apply it to
an objective function and evaluate the results.
Next, we will learn about an algorithm that further extends on AdaGrad and RMSProp.
Gradient Descent with Adadelta
25
Gradient descent is an optimization algorithm that follows the negative gradient of an objective
function in order to locate the minimum of the function. A limitation of gradient descent is
that it uses the same step size (learning rate) for each input variable. AdaGrad and RMSProp
are extensions to gradient descent that add a self-adaptive learning rate for each parameter for
the objective function. Adadelta can be considered a further extension of gradient descent that
builds upon AdaGrad and RMSProp and changes the calculation of the custom step size so that
the units are consistent and in turn no longer requires an initial learning rate hyperparameter.
In this tutorial, you will discover how to develop the gradient descent with Adadelta
optimization algorithm from scratch. After completing this tutorial, you will know:
▷ Gradient descent is an optimization algorithm that uses the gradient of the objective
function to navigate the search space.
▷ Gradient descent can be updated to use an automatically adaptive step size for each
input variable using a decaying average of partial derivatives, called Adadelta.
▷ How to implement the Adadelta optimization algorithm from scratch and apply it to an
objective function and evaluate the results.
LetŠs get started.
“ minimum . . .
Ů Page 69, Algorithms for Optimization, 2019. ”
The Ąrst order derivative2 , or simply the Şderivative,Ť is the rate of change or slope of
the target function at a speciĄc point, e.g. for a speciĄc input. If the target function takes
multiple input variables, it is referred to as a multivariate function and the input variables can
be thought of as a vector. In turn, the derivative of a multivariate target function may also be
taken as a vector and is referred to generally as the gradient3 .
▷ Gradient: First order derivative for a multivariate objective function.
The derivative or the gradient points in the direction of the steepest ascent of the target function
for a speciĄc input. Gradient descent refers to a minimization optimization algorithm that
follows the negative of the gradient downhill of the target function to locate the minimum of the
function. The gradient descent algorithm requires a target function that is being optimized and
the derivative function for the objective function. The target function f () returns a score for a
given set of inputs, and the derivative function f ′ () gives the derivative of the target function
for a given set of inputs. The gradient descent algorithm requires a starting point (x) in the
problem, such as a randomly selected point in the input space. The derivative is then calculated
and a step is taken in the input space that is expected to result in a downhill movement in the
target function, assuming we are minimizing the target function. A downhill movement is made
by Ąrst calculating how far to move in the input space, calculated as the step size (called α or
the learning rate) multiplied by the gradient. This is then subtracted from the current point,
ensuring we move against the gradient, or down the target function.
xnew = x − α × f ′ (x)
The steeper the objective function at a given point, the larger the magnitude of the gradient,
and in turn, the larger the step taken in the search space. The size of the step taken is scaled
using a step size hyperparameter.
▷ Step Size (α): Hyperparameter that controls how far to move in the search space against
the gradient each iteration of the algorithm.
If the step size is too small, the movement in the search space will be small and the search will
take a long time. If the step size is too large, the search may bounce around the search space
and skip over the optima.
Now that we are familiar with the gradient descent optimization algorithm, letŠs take a
look at Adadelta.
1
https://en.wikipedia.org/wiki/Gradient_descent
2
https://en.wikipedia.org/wiki/Derivative
3
https://en.wikipedia.org/wiki/Gradient
25.3 Adadelta Algorithm 283
Where s(t + 1) is the mean squared partial derivative for one parameter for the current iteration
of the algorithm, s(t) is the decaying moving average squared partial derivative for the previous
iteration, f ′ (x(t))2 is the squared partial derivative for the current parameter, and rho is a
hyperparameter, typically with the value of 0.9 like momentum.
Adadelta is a further extension of RMSProp designed to improve the convergence of the
algorithm and to remove the need for a manually speciĄed initial learning rate.
The idea presented in this paper was derived from ADAGRAD in order to improve
“ upon the two main drawbacks of the method: 1) the continual decay of learning
rates throughout training, and 2) the need for a manually selected global learning
rate.
Ů ŞADADELTA: An Adaptive Learning Rate MethodŤ, 2012.
The decaying moving average of the squared partial derivative is calculated for each
”
parameter, as with RMSProp. The key difference is in the calculation of the step size for a
parameter that uses the decaying average of the delta or change in parameter. This choice of
numerator was to ensure that both parts of the calculation have the same units.
4
https://www.linkedin.com/in/mattzeiler/
5
https://arxiv.org/abs/1212.5701
25.4 Gradient Descent With Adadelta 284
After independently deriving the RMSProp update, the authors noticed that the
“ units in the update equations for gradient descent, momentum and Adagrad do not
match. To Ąx this, they use an exponentially decaying average of the square updates
Ů Pages 78Ű79, Algorithms for Optimization, 2019. ”
First, the custom step size is calculated as the square root of the decaying moving average
of the change in the delta divided by the square root of the decaying moving average of the
squared partial derivatives. q
ϵ + δ(t)
α(t + 1) = q
ϵ + s(t)
Where α(t + 1) is the custom step size for a parameter for a given update, ϵ is a hyperparameter
that is added to the numerator and denominator to avoid a divide by zero error, δ(t) is the
decaying moving average of the squared change to the parameter (calculated in the last iteration),
and s(t) is the decaying moving average of the squared partial derivative (calculated in the
current iteration).
The ϵ hyperparameter is set to a small value such as 10−3 or 10−8 . In addition to avoiding a
divide by zero error, it also helps with the Ąrst step of the algorithm when the decaying moving
average squared change and decaying moving average squared gradient are zero. Next, the
change to the parameter is calculated as the custom step size multiplied by the partial derivative
∆x(t + 1) = α(t + 1) × f ′ (x(t))
Next, the decaying average of the squared change to the parameter is updated.
δ(t + 1) = δ(t) × ρ + ∆x(t + 1)2 × (1 − ρ)
Where δ(t + 1) is the decaying average of the change to the variable to be used in the next
iteration, ∆x(t + 1) was calculated in the step before and ρ is a hyperparameter that acts like
momentum and has a value like 0.9. Finally, the new value for the variable is calculated using
the change.
x(t + 1) = x(t) − ∆x(t + 1)
This process is then repeated for each variable for the objective function, then the entire process
is repeated to navigate the search space for a Ąxed number of algorithm iterations. Now that we
are familiar with the Adadelta algorithm, letŠs explore how we might implement it and evaluate
its performance.
We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the
response surface. The complete example of plotting the objective function is listed below.
# objective function
def objective(x, y):
return x**2.0 + y**2.0
Running the example creates a three dimensional surface plot of the objective function. We can
see the familiar bowl shape with the global minima at f (0, 0) = 0.
25.4 Gradient Descent With Adadelta 286
We can also create a two-dimensional plot of the function. This will be helpful later when
we want to plot the progress of the search. The example below creates a contour plot of the
objective function.
# objective function
def objective(x, y):
return x**2.0 + y**2.0
Running the example creates a two-dimensional contour plot of the objective function. We can
see the bowl shape compressed to contours shown with a color gradient. We will use this plot
to plot the speciĄc points explored during the progress of the search.
Now that we have a test objective function, letŠs look at how we might implement the
Adadelta optimization algorithm.
f (x) = x2
f ′ (x) = 2 × x
The derivative of x2 is 2x in each dimension. The derivative() function implements this below.
Next, we can implement gradient descent optimization. First, we can select a random point in
the bounds of the problem as a starting point for the search. This assumes we have an array
that deĄnes the bounds of the search with one row for each dimension and the Ąrst column
deĄnes the minimum and the second column deĄnes the maximum of the dimension.
...
solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
Program 25.5: Generate an initial point
25.4 Gradient Descent With Adadelta 288
Next, we need to initialize the decaying average of the squared partial derivatives and squared
change for each dimension to 0.0 values.
...
# list of the average square gradients for each variable
sq_grad_avg = [0.0 for _ in range(bounds.shape[0])]
# list of the average parameter updates
sq_para_avg = [0.0 for _ in range(bounds.shape[0])]
Program 25.6: List of average square gradient and average parameter updates
We can then enumerate a Ąxed number of iterations of the search optimization algorithm deĄned
by a Şn_iterŤ hyperparameter.
...
for it in range(n_iter):
...
Program 25.7: Run the gradient descent
The Ąrst step is to calculate the gradient for the current solution using the derivative()
function.
...
gradient = derivative(solution[0], solution[1])
Program 25.8: Calculate gradient
We then need to calculate the square of the partial derivative and update the decaying moving
average of the squared partial derivatives with the ŞrhoŤ hyperparameter.
...
for i in range(gradient.shape[0]):
# calculate the squared gradient
sg = gradient[i]**2.0
# update the moving average of the squared gradient
sq_grad_avg[i] = (sq_grad_avg[i] * rho) + (sg * (1.0-rho))
Program 25.9: Update the average of the squared partial derivatives
We can then use the decaying moving average of the squared partial derivatives and gradient to
calculate the step size for the next point. We will do this one variable at a time.
...
new_solution = list()
for i in range(solution.shape[0]):
...
Program 25.10: Build solution
First, we will calculate the custom step size for this variable on this iteration using the decaying
moving average of the squared changes and squared partial derivatives, as well as the ŞepŤ
hyperparameter.
25.4 Gradient Descent With Adadelta 289
...
alpha = (ep + sqrt(sq_para_avg[i])) / (ep + sqrt(sq_grad_avg[i]))
Program 25.11: Calculate the step size for this variable
Next, we can use the custom step size and partial derivative to calculate the change to the
variable.
...
change = alpha * gradient[i]
Program 25.12: Calculate the change
We can then use the change to update the decaying moving average of the squared change using
the ŞrhoŤ hyperparameter.
...
sq_para_avg[i] = (sq_para_avg[i] * rho) + (change**2.0 * (1.0-rho))
Program 25.13: Update the moving average of squared parameter changes
Finally, we can change the variable and store the result before moving on to the next variable.
...
# calculate the new position in this variable
value = solution[i] - change
# store this variable
new_solution.append(value)
Program 25.14: Calculate the new position
This new solution can then be evaluated using the objective() function and the performance
of the search can be reported.
...
# evaluate candidate point
solution = asarray(new_solution)
solution_eval = objective(solution[0], solution[1])
# report progress
print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
Program 25.15: Evaluate candidate point and report progress
Note: We are using simple Python lists and imperative programming style instead
ò of NumPy arrays or list compressions intentionally to make the code more readable
for Python beginners.
We can then deĄne our hyperparameters and call the adadelta() function to optimize our
test objective function. In this case, we will use 120 iterations of the algorithm and a value of
0.99 for the rho hyperparameter, chosen after a little trial and error.
...
# seed the pseudo random number generator
seed(1)
# define range for input
25.4 Gradient Descent With Adadelta 291
Tying all of this together, the complete example of gradient descent optimization with Adadelta
is listed below.
# objective function
def objective(x, y):
return x**2.0 + y**2.0
Running the example applies the Adadelta optimization algorithm to our test problem and
reports performance of the search for each iteration of the algorithm.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that a near optimal solution was found after perhaps 105 iterations
of the search, with input values near 0.0 and 0.0, evaluating to 0.0.
...
>100 f([-1.45142626e-07 2.71163181e-03]) = 0.00001
>101 f([-1.24898699e-07 2.56875692e-03]) = 0.00001
>102 f([-1.07454197e-07 2.43328237e-03]) = 0.00001
>103 f([-9.24253035e-08 2.30483111e-03]) = 0.00001
>104 f([-7.94803792e-08 2.18304501e-03]) = 0.00000
>105 f([-6.83329263e-08 2.06758392e-03]) = 0.00000
>106 f([-5.87354975e-08 1.95812477e-03]) = 0.00000
>107 f([-5.04744185e-08 1.85436071e-03]) = 0.00000
>108 f([-4.33652179e-08 1.75600036e-03]) = 0.00000
>109 f([-3.72486699e-08 1.66276699e-03]) = 0.00000
>110 f([-3.19873691e-08 1.57439783e-03]) = 0.00000
>111 f([-2.74627662e-08 1.49064334e-03]) = 0.00000
>112 f([-2.3572602e-08 1.4112666e-03]) = 0.00000
25.4 Gradient Descent With Adadelta 293
We can then execute the search as before, and this time retrieve the list of solutions instead of
the best Ąnal solution.
...
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 120
# rho for adadelta
rho = 0.99
# perform the gradient descent search with adadelta
solutions = adadelta(objective, derivative, bounds, n_iter, rho)
Program 25.20: Perform gradient descent search with Adadelta
...
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
Program 25.21: Create contour plot of the objective function
Finally, we can plot each solution found during the search as a white dot connected by a line.
...
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')
Program 25.22: Plot the sample as black circles
Tying this all together, the complete example of performing the Adadelta optimization on the
test problem and plotting the results on a contour plot is listed below.
25.4 Gradient Descent With Adadelta 295
# objective function
def objective(x, y):
return x**2.0 + y**2.0
Running the example performs the search as before, except in this case, the contour plot of
the objective function is created. In this case, we can see that a white dot is shown for each
solution found during the search, starting above the optima and progressively getting closer to
the optima at the center of the plot.
25.5 Further Reading 297
Figure 25.3: Contour Plot of the Test Objective Function With Adadelta Search Results
Shown
Papers
Matthew D. Zeiler. ŞADADELTA: An Adaptive Learning Rate MethodŤ. arXiv 1212.5701. 2012.
https://arxiv.org/abs/1212.5701
Books
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
https://amzn.to/3qSk3C2
APIs
numpy.random.rand API.
https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.
html
numpy.asarray API.
https://numpy.org/doc/stable/reference/generated/numpy.asarray.html
Matplotlib API.
https://matplotlib.org/api/pyplot_api.html
25.6 Summary 298
Articles
Gradient descent. Wikipedia.
https://en.wikipedia.org/wiki/Gradient_descent
Stochastic gradient descent. Wikipedia.
https://en.wikipedia.org/wiki/Stochastic_gradient_descent
An overview of gradient descent optimization algorithms. 2016.
https://ruder.io/optimizing-gradient-descent/index.html
25.6 Summary
In this tutorial, you discovered how to develop the gradient descent with Adadelta optimization
algorithm from scratch. SpeciĄcally, you learned:
▷ Gradient descent is an optimization algorithm that uses the gradient of the objective
function to navigate the search space.
▷ Gradient descent can be updated to use an automatically adaptive step size for each
input variable using a decaying average of partial derivatives, called Adadelta.
▷ How to implement the Adadelta optimization algorithm from scratch and apply it to an
objective function and evaluate the results.
Next, you will learn about Adam, another variant of gradient descent.
Adam Optimization Algorithm
26
Gradient descent is an optimization algorithm that follows the negative gradient of an objective
function in order to locate the minimum of the function. A limitation of gradient descent is that
a single step size (learning rate) is used for all input variables. Extensions to gradient descent
like AdaGrad and RMSProp update the algorithm to use a separate step size for each input
variable but may result in a step size that rapidly decreases to very small values. The Adaptive
Movement Estimation algorithm, or Adam for short, is an extension to gradient descent and
a natural successor to techniques like AdaGrad and RMSProp that automatically adapts a
learning rate for each input variable for the objective function and further smooths the search
process by using an exponentially decreasing moving average of the gradient to make updates
to variables.
In this tutorial, you will discover how to develop gradient descent with Adam optimization
algorithm from scratch. After completing this tutorial, you will know:
▷ Gradient descent is an optimization algorithm that uses the gradient of the objective
function to navigate the search space.
▷ Gradient descent can be updated to use an automatically adaptive step size for each
input variable using a decaying average of partial derivatives, called Adam.
▷ How to implement the Adam optimization algorithm from scratch and apply it to an
objective function and evaluate the results.
LetŠs get started.
“ minimum . . .
Ů Page 69, Algorithms for Optimization, 2019. ”
The Ąrst-order derivative2 , or simply the Şderivative,Ť is the rate of change or slope of
the target function at a speciĄc point, e.g. for a speciĄc input. If the target function takes
multiple input variables, it is referred to as a multivariate function and the input variables can
be thought of as a vector. In turn, the derivative of a multivariate target function may also be
taken as a vector and is referred to generally as the gradient3 .
▷ Gradient: First-order derivative for a multivariate objective function.
The derivative or the gradient points in the direction of the steepest ascent of the target function
for a speciĄc input. Gradient descent refers to a minimization optimization algorithm that
follows the negative of the gradient downhill of the target function to locate the minimum of
the function.
The gradient descent algorithm requires a target function that is being optimized and the
derivative function for the objective function. The target function f () returns a score for a given
set of inputs, and the derivative function f ′ () gives the derivative of the target function for a
given set of inputs. The gradient descent algorithm requires a starting point (x) in the problem,
such as a randomly selected point in the input space. The derivative is then calculated and a
step is taken in the input space that is expected to result in a downhill movement in the target
function, assuming we are minimizing the target function. A downhill movement is made by
Ąrst calculating how far to move in the input space, calculated as the step size (called α or
the learning rate) multiplied by the gradient. This is then subtracted from the current point,
ensuring we move against the gradient, or down the target function.
xnew = x − α × f ′ (x)
The steeper the objective function at a given point, the larger the magnitude of the gradient
and, in turn, the larger the step taken in the search space. The size of the step taken is scaled
using a step size hyperparameter.
▷ Step Size (α): Hyperparameter that controls how far to move in the search space against
the gradient each iteration of the algorithm.
If the step size is too small, the movement in the search space will be small and the search will
take a long time. If the step size is too large, the search may bounce around the search space
and skip over the optima.
1
https://en.wikipedia.org/wiki/Gradient_descent
2
https://en.wikipedia.org/wiki/Derivative
3
https://en.wikipedia.org/wiki/Gradient
26.3 Adam Optimization Algorithm 301
Now that we are familiar with the gradient descent optimization algorithm, letŠs take a
look at the Adam algorithm.
m=0
ν=0
The algorithm is executed iteratively over time t starting at t = 1, and each iteration involves
calculating a new set of parameter values x, i.e. going from x(t − 1) to x(t). It is perhaps easy to
understand the algorithm if we focus on updating one parameter, which generalizes to updating
all parameters via vector operations. First, the gradient (partial derivatives) are calculated for
the current time step.
g(t) = f ′ (x(t − 1))
Next, the Ąrst moment is updated using the gradient and a hyperparameter β1 .
Then the second moment is updated using the squared gradient and a hyperparameter β2 .
The Ąrst and second moments are biased because they are initialized with zero values.
. . . these moving averages are initialized as (vectors of) 0Šs, leading to moment
“ estimates that are biased towards zero, especially during the initial timesteps, and
especially when the decay rates are small (i.e. the betas are close to 1). The good news
is that this initialization bias can be easily counteracted, resulting in bias-corrected
estimates . . .
Ů ŞAdam: A Method for Stochastic OptimizationŤ, 2015.
Next the Ąrst and second moments are bias-corrected, starting with the Ąrst moment:
”
m(t)
m̂(t) =
1 − β1 (t)
And then the second moment:
ν(t)
ν̂(t) =
1 − β2 (t)
Note, β1 (t) and β2 (t) refer to the β1 and β2 hyperparameters that are decayed on a schedule
over the iterations of the algorithm. A static decay schedule can be used, although the paper
recommend the following:
β1 (t) = β1t
β2 (t) = β2t
Finally, we can calculate the value for the parameter for this iteration.
α × m̂(t)
x(t) = x(t − 1) − q
ν̂(t) + ϵ
Where α is the step size hyperparameter, ϵ is a small value such as 10−8 that ensures we do not
encounter a divide by zero error. Note, a more efficient reordering of the update rule listed in
the paper can be used:
q
1 − β2 (t)
α(t) = α
1 − β1 (t)
α(t) × m(t)
x(t) = x(t − 1) − q
ν(t) + ϵ
To review, there are three hyperparameters for the algorithm, they are:
▷ α: Initial step size (learning rate), a typical value is 0.001.
▷ β1 : Decay factor for Ąrst momentum, a typical value is 0.9.
▷ β2 : Decay factor for inĄnity norm, a typical value is 0.999.
And thatŠs it. For full derivation of the Adam algorithm in the context of the Adam algorithm,
I recommend reading the paper.
26.4 Gradient Descent With Adam 303
▷ Diederik P. Kingma and Jimmy Lei Ba, ŞAdam: A Method for Stochastic Optimization,Ť
in Proc. 3rd Int. Conf. Learning Representations (ICLR), 2015.5
Next, letŠs look at how we might implement the algorithm from scratch in Python.
We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the
response surface. The complete example of plotting the objective function is listed below.
# objective function
def objective(x, y):
return x**2.0 + y**2.0
5
https://arxiv.org/abs/1412.6980
26.4 Gradient Descent With Adam 304
Running the example creates a three-dimensional surface plot of the objective function. We can
see the familiar bowl shape with the global minima at f (0, 0) = 0.
We can also create a two-dimensional plot of the function. This will be helpful later when
we want to plot the progress of the search. The example below creates a contour plot of the
objective function.
# objective function
def objective(x, y):
return x**2.0 + y**2.0
Running the example creates a two-dimensional contour plot of the objective function. We can
see the bowl shape compressed to contours shown with a color gradient. We will use this plot
to plot the speciĄc points explored during the progress of the search.
Now that we have a test objective function, letŠs look at how we might implement the
Adam optimization algorithm.
Next, we can implement gradient descent optimization. First, we can select a random point in
the bounds of the problem as a starting point for the search. This assumes we have an array
that deĄnes the bounds of the search with one row for each dimension and the Ąrst column
deĄnes the minimum and the second column deĄnes the maximum of the dimension.
26.4 Gradient Descent With Adam 306
...
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
score = objective(x[0], x[1])
Program 26.5: Evaluate the objective function at a random point
...
m = [0.0 for _ in range(bounds.shape[0])]
v = [0.0 for _ in range(bounds.shape[0])]
Program 26.6: Initialize Ąrst and second moments
We then run a Ąxed number of iterations of the algorithm deĄned by the Şn_iterŤ
hyperparameter.
...
for t in range(n_iter):
...
Program 26.7: Run iterations of gradient descent
The Ąrst step is to calculate the gradient for the current solution using the derivative()
function.
...
gradient = derivative(solution[0], solution[1])
Program 26.8: Calculate gradient
The Ąrst step is to calculate the derivative for the current set of parameters.
...
g = derivative(x[0], x[1])
Program 26.9: Calculate gradient
Next, we need to perform the Adam update calculations. We will perform these calculations
one variable at a time using an imperative programming style for readability. In practice, I
recommend using NumPy vector operations for efficiency.
...
for i in range(x.shape[0]):
...
Program 26.10: Build a solution one variable at a time
...
# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)
m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]
Program 26.11: Calculate the Ąrst moment
26.4 Gradient Descent With Adam 307
...
# v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2
v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2
Program 26.12: Calculate the second moment
Then the bias correction for the Ąrst and second moments.
...
# mhat(t) = m(t) / (1 - beta1(t))
mhat = m[i] / (1.0 - beta1**(t+1))
# vhat(t) = v(t) / (1 - beta2(t))
vhat = v[i] / (1.0 - beta2**(t+1))
Program 26.13: Calculate the bias correction for the Ąrst and second moments
...
# x(t) = x(t-1) - alpha * mhat(t) / (sqrt(vhat(t)) + eps)
x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps)
Program 26.14: Update the variable
This is then repeated for each parameter that is being optimized. At the end of the iteration we
can evaluate the new parameter values and report the performance of the search.
...
# evaluate candidate point
score = objective(x[0], x[1])
# report progress
print('>%d f(%s) = %.5f' % (t, x, score))
Program 26.15: Evaluate candidate point and report progress
We can tie all of this together into a function named adam() that takes the names of the objective
and derivative functions as well as the algorithm hyperparameters, and returns the best solution
found at the end of the search and its evaluation. This complete function is listed below.
Note: We are using simple Python lists and imperative programming style instead
ò of NumPy arrays or list compressions intentionally to make the code more readable
for Python beginners.
We can then deĄne our hyperparameters and call the adam() function to optimize our test
objective function. In this case, we will use 60 iterations of the algorithm with an initial steps
size of 0.02 and beta1 and beta2 values of 0.8 and 0.999 respectively. These hyperparameter
values were found after a little trial and error.
...
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 60
# steps size
alpha = 0.02
# factor for average gradient
beta1 = 0.8
# factor for average squared gradient
beta2 = 0.999
# perform the gradient descent search with adam
best, score = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2)
print('Done!')
print('f(%s) = %f' % (best, score))
Program 26.17: Perform gradient descent search with Adam
Tying all of this together, the complete example of gradient descent optimization with Adam is
listed below.
26.4 Gradient Descent With Adam 309
# objective function
def objective(x, y):
return x**2.0 + y**2.0
beta2 = 0.999
# perform the gradient descent search with adam
best, score = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2)
print('Done!')
print('f(%s) = %f' % (best, score))
Program 26.18: Gradient descent optimization with Adam for a two-dimensional test
function
Running the example applies the Adam optimization algorithm to our test problem and reports
the performance of the search for each iteration of the algorithm.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that a near-optimal solution was found after perhaps 53 iterations
of the search, with input values near 0.0 and 0.0, evaluating to 0.0.
...
>50 f([-0.00056912 -0.00321961]) = 0.00001
>51 f([-0.00052452 -0.00286514]) = 0.00001
>52 f([-0.00043908 -0.00251304]) = 0.00001
>53 f([-0.0003283 -0.00217044]) = 0.00000
>54 f([-0.00020731 -0.00184302]) = 0.00000
>55 f([-8.95352320e-05 -1.53514076e-03]) = 0.00000
>56 f([ 1.43050285e-05 -1.25002847e-03]) = 0.00000
>57 f([ 9.67123406e-05 -9.89850279e-04]) = 0.00000
>58 f([ 0.00015359 -0.00075587]) = 0.00000
>59 f([ 0.00018407 -0.00054858]) = 0.00000
Done!
f([ 0.00018407 -0.00054858]) = 0.000000
Output 26.1: Result from Program 26.18
We can then execute the search as before, and this time retrieve the list of solutions instead
of the best Ąnal solution.
...
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 60
# steps size
alpha = 0.02
# factor for average gradient
beta1 = 0.8
# factor for average squared gradient
beta2 = 0.999
# perform the gradient descent search with adam
solutions = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2)
Program 26.20: Perform the gradient descent search with Adam
...
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
26.4 Gradient Descent With Adam 312
Finally, we can plot each solution found during the search as a white dot connected by a line.
...
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')
Program 26.22: Plot the sample as black circles
Tying this all together, the complete example of performing the Adam optimization on the test
problem and plotting the results on a contour plot is listed below.
# objective function
def objective(x, y):
return x**2.0 + y**2.0
Running the example performs the search as before, except in this case, a contour plot of the
objective function is created. In this case, we can see that a white dot is shown for each solution
found during the search, starting above the optima and progressively getting closer to the
optima at the center of the plot.
26.5 Further Reading 314
Figure 26.3: Contour Plot of the Test Objective Function With Adam Search Results
Shown
Papers
Diederik P. Kingma and Jimmy Lei Ba. ŞAdam: A Method for Stochastic OptimizationŤ. In:
Proc. 3rd Int. Conf. Learning Representations (ICLR). 2015.
https://arxiv.org/abs/1412.6980
Books
Mykel J. Kochenderfer and Tim A. Wheeler. Algorithms for Optimization. MIT Press, 2019.
https://amzn.to/3je8O1J
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
https://amzn.to/3qSk3C2
APIs
numpy.random.rand API.
https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.
html
numpy.asarray API.
https://numpy.org/doc/stable/reference/generated/numpy.asarray.html
Matplotlib API.
https://matplotlib.org/api/pyplot_api.html
26.6 Summary 315
Articles
Gradient descent. Wikipedia.
https://en.wikipedia.org/wiki/Gradient_descent
Stochastic gradient descent. Wikipedia.
https://en.wikipedia.org/wiki/Stochastic_gradient_descent
An overview of gradient descent optimization algorithms. 2016.
https://ruder.io/optimizing-gradient-descent/index.html
26.6 Summary
In this tutorial, you discovered how to develop gradient descent with Adam optimization
algorithm from scratch. SpeciĄcally, you learned:
▷ Gradient descent is an optimization algorithm that uses the gradient of the objective
function to navigate the search space.
▷ Gradient descent can be updated to use an automatically adaptive step size for each
input variable using a decaying average of partial derivatives, called Adam.
▷ How to implement the Adam optimization algorithm from scratch and apply it to an
objective function and evaluate the results.
Here we Ąnish with all gradient descent algorithms. Next we will see how the different
optimization algorithms can be used.
VI
Projects
Use Optimization Algorithms to
Manually Fit Regression Models
27
Regression models are Ąt on training data using linear regression and local search optimization
algorithms. Models like linear regression and logistic regression are trained by least squares
optimization, and this is the most efficient approach to Ąnding coefficients that minimize error
for these models. Nevertheless, it is possible to use alternate optimization algorithms to Ąt a
regression model to a training dataset. This can be a useful exercise to learn more about how
regression functions and the central nature of optimization in applied machine learning. It may
also be required for regression with data that does not meet the requirements of a least squares
optimization procedure.
In this tutorial, you will discover how to manually optimize the coefficients of regression
models. After completing this tutorial, you will know:
▷ How to develop the inference models for regression from scratch.
▷ How to optimize the coefficients of a linear regression model for predicting numeric
values.
▷ How to optimize the coefficients of a logistic regression model using stochastic hill
climbing.
LetŠs get started.
weighted sum of the inputs. Linear regression1 is designed for ŞregressionŤ problems that require
a number to be predicted, and logistic regression2 is designed for ŞclassiĄcationŤ problems that
require a class label to be predicted. These regression models involve the use of an optimization
algorithm to Ąnd a set of coefficients for each input to the model that minimizes the prediction
error. Because the models are linear and well understood, efficient optimization algorithms can
be used.
In the case of linear regression, the coefficients can be found by least squares optimization,
which can be solved using linear algebra. In the case of logistic regression, a local search
optimization algorithm is commonly used. It is possible to use any arbitrary optimization
algorithm to train linear and logistic regression models. That is, we can deĄne a regression
model and use a given optimization algorithm to Ąnd a set of coefficients for the model that
result in a minimum of prediction error or a maximum of classiĄcation accuracy.
Using alternate optimization algorithms is expected to be less efficient on average than
using the recommended optimization. Nevertheless, it may be more efficient in some speciĄc
cases, such as if the input data does not meet the expectations of the model like a Gaussian
distribution and is uncorrelated with outer inputs. It can also be an interesting exercise to
demonstrate the central nature of optimization in training machine learning algorithms, and
speciĄcally regression models.
Next, letŠs explore how to train a linear regression model using stochastic hill climbing.
Running the example prints the shape of the created dataset, conĄrming our expectations.
1
https://en.wikipedia.org/wiki/Linear_regression
2
https://en.wikipedia.org/wiki/Logistic_regression
3
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html
27.3 Optimize a Linear Regression Model 319
Next, we need to deĄne a linear regression model. Before we optimize the model coefficients,
we must develop the model and our conĄdence in how it works. LetŠs start by developing a
function that calculates the activation of the model for a given input row of data from the
dataset. This function will take the row of data and the coefficients for the model and calculate
the weighted sum of the input with the addition of an extra y-intercept (also called the offset
or bias) coefficient. The predict_row() function below implements this. We are using simple
Python lists and imperative programming style instead of NumPy arrays or list comprehension
intentionally to make the code more readable for Python beginners. Feel free to optimize it.
Next, we can call the predict_row() function for each row in a given dataset. The
predict_dataset() function below implements this. Again, we are intentionally using a
simple imperative coding style for readability instead of list comprehension.
Finally, we can use the model to make predictions on our synthetic dataset to conĄrm it is
all working correctly. We can generate a random set of model coefficients using the rand()
function4 . Recall that we need one coefficient for each input (ten inputs in this dataset) plus an
extra weight for the y-intercept coefficient.
...
# define dataset
❈
X, y = make_regression(n_samples=1000, n_features=10, n_informative=2, noise=0.2, ❈
random_state=1)
# determine the number of coefficients
n_coeff = X.shape[1] + 1
# generate random coefficients
coefficients = rand(n_coeff)
Program 27.4: DeĄne dataset and generate random model coefficients
We can then use these coefficients with the dataset to make predictions.
4
https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.html
27.3 Optimize a Linear Regression Model 320
...
yhat = predict_dataset(X, coefficients)
Program 27.5: Generate predictions for dataset
...
score = mean_squared_error(y, yhat)
print('MSE: %f' % score)
Program 27.6: Calculate model prediction error
ThatŠs it. We can tie all of this together and demonstrate our linear regression model for
regression predictive modeling. The complete example is listed below.
# linear regression
def predict_row(row, coefficients):
# add the bias, the last coefficient
result = coefficients[-1]
# add the weighted input
for i in range(len(row)):
result += coefficients[i] * row[i]
return result
# define dataset
❈
X, y = make_regression(n_samples=1000, n_features=10, n_informative=2, noise=0.2, ❈
random_state=1)
# determine the number of coefficients
n_coeff = X.shape[1] + 1
# generate random coefficients
coefficients = rand(n_coeff)
# generate predictions for dataset
yhat = predict_dataset(X, coefficients)
# calculate model prediction error
score = mean_squared_error(y, yhat)
print('MSE: %f' % score)
Program 27.7: Linear regression model
27.3 Optimize a Linear Regression Model 321
Running the example generates a prediction for each example in the training dataset, then
prints the mean squared error for the predictions.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
We would expect a large error given a set of random weights, and that is what we see in
this case, with an error value of about 7,307 units.
MSE: 7307.756740
Output 27.2: Result from Program 27.7
We can now optimize the coefficients of the dataset to achieve low error on this dataset. First,
we need to split the dataset into train and test sets. It is important to hold back some data not
used in optimizing the model so that we can prepare a reasonable estimate of the performance
of the model when used to make predictions on new data. We will use 67 percent of the data for
training and the remaining 33 percent as a test set for evaluating the performance of the model.
...
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
Program 27.8: Split data into train test sets
Next, we can develop a stochastic hill climbing algorithm. The optimization algorithm requires
an objective function to optimize. It must take a set of coefficients and return a score that is
to be minimized or maximized corresponding to a better model. In this case, we will evaluate
the mean squared error of the model with a given set of coefficients and return the error score,
which must be minimized. The objective() function below implements this, given the dataset
and a set of coefficients, and returns the error of the model.
Next, we can deĄne the stochastic hill climbing algorithm. The algorithm will require an
initial solution (e.g. random coefficients) and will iteratively keep making small changes to the
solution and checking if it results in a better performing model. The amount of change made to
the current solution is controlled by a step_size hyperparameter. This process will continue
for a Ąxed number of iterations, also provided as a hyperparameter. The hillclimbing()
function below implements this, taking the dataset, objective function, initial solution, and
hyperparameters as arguments and returns the best set of coefficients found and the estimated
performance.
27.3 Optimize a Linear Regression Model 322
We can then call this function, passing in an initial set of coefficients as the initial solution and
the training dataset as the dataset to optimize the model against.
...
# define the total iterations
n_iter = 2000
# define the maximum step size
step_size = 0.15
# determine the number of coefficients
n_coef = X.shape[1] + 1
# define the initial solution
solution = rand(n_coef)
# perform the hill climbing search
❈
coefficients, score = hillclimbing(X_train, y_train, objective, solution, n_iter, ❈
step_size)
print('Done!')
print('Coefficients: %s' % coefficients)
print('Train MSE: %f' % (score))
Program 27.11: Perform hill climbing search
Finally, we can evaluate the best model on the test dataset and report the performance.
...
# generate predictions for the test dataset
yhat = predict_dataset(X_test, coefficients)
# calculate accuracy
score = mean_squared_error(y_test, yhat)
print('Test MSE: %f' % (score))
Program 27.12: Evaluate the best model on test dataset
Tying this together, the complete example of optimizing the coefficients of a linear regression
model on the synthetic regression dataset is listed below.
27.3 Optimize a Linear Regression Model 323
# linear regression
def predict_row(row, coefficients):
# add the bias, the last coefficient
result = coefficients[-1]
# add the weighted input
for i in range(len(row)):
result += coefficients[i] * row[i]
return result
# objective function
def objective(X, y, coefficients):
# generate predictions for dataset
yhat = predict_dataset(X, coefficients)
# calculate accuracy
score = mean_squared_error(y, yhat)
return score
# define dataset
❈
X, y = make_regression(n_samples=1000, n_features=10, n_informative=2, noise=0.2, ❈
random_state=1)
27.3 Optimize a Linear Regression Model 324
Running the example will report the iteration number and mean squared error each time there
is an improvement made to the model. At the end of the search, the performance of the best
set of coefficients on the training dataset is reported and the performance of the same model on
the test dataset is calculated and reported.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the optimization algorithm found a set of coefficients that
achieved an error of about 0.08 on both the train and test datasets. The fact that the algorithm
found a model with very similar performance on train and test datasets is a good sign, showing
that the model did not overĄt (over-optimize) to the training dataset. This means the model
generalizes well to new data.
...
>1546 0.35426
>1567 0.32863
>1572 0.32322
>1619 0.24890
>1665 0.24800
>1691 0.24162
>1715 0.15893
>1809 0.15337
>1892 0.14656
>1956 0.08042
Done!
27.4 Optimize a Logistic Regression Model 325
Now that we are familiar with how to manually optimize the coefficients of a linear regression
model, letŠs look at how we can extend the example to optimize the coefficients of a logistic
regression model for classiĄcation.
Running the example prints the shape of the created dataset, conĄrming our expectations.
(1000, 5) (1000,)
Output 27.4: Result from Program 27.14
Next, we need to deĄne a logistic regression model. LetŠs start by updating the predict_row()
function to pass the weighted sum of the input and coefficients through a logistic function. The
logistic function is deĄned as:
1
logistic =
1 + exp(−result)
5
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html
27.4 Optimize a Logistic Regression Model 326
Where result is the weighted sum of the inputs and the coefficients and exp()Âăis e (EulerŠs
number)6 raised to the power of the provided value, implemented via the exp() function7 . The
updated predict_row() function is listed below.
ThatŠs about it in terms of changes for linear regression to logistic regression. As with linear
regression, we can test the model with a set of random model coefficients.
...
# determine the number of coefficients
n_coeff = X.shape[1] + 1
# generate random coefficients
coefficients = rand(n_coeff)
# generate predictions for dataset
yhat = predict_dataset(X, coefficients)
Program 27.16: Test the model with random coefficients
The predictions made by the model are probabilities for an example belonging to class=1. We
can round the prediction to be integer values 0 and 1 for the expected class labels.
...
yhat = [round(y) for y in yhat]
Program 27.17: Round predictions to labels
...
score = accuracy_score(y, yhat)
print('Accuracy: %f' % score)
Program 27.18: Calculate accuracy
ThatŠs it. We can tie all of this together and demonstrate our simple logistic regression model
for binary classiĄcation. The complete example is listed below.
6
https://en.wikipedia.org/wiki/E_(mathematical_constant)
7
https://docs.python.org/3/library/math.html#math.exp
27.4 Optimize a Logistic Regression Model 327
# logistic regression
def predict_row(row, coefficients):
# add the bias, the last coefficient
result = coefficients[-1]
# add the weighted input
for i in range(len(row)):
result += coefficients[i] * row[i]
# logistic function
logistic = 1.0 / (1.0 + exp(-result))
return logistic
# define dataset
❈
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, ❈
n_redundant=1, random_state=1)
# determine the number of coefficients
n_coeff = X.shape[1] + 1
# generate random coefficients
coefficients = rand(n_coeff)
# generate predictions for dataset
yhat = predict_dataset(X, coefficients)
# round predictions to labels
yhat = [round(y) for y in yhat]
# calculate accuracy
score = accuracy_score(y, yhat)
print('Accuracy: %f' % score)
Program 27.19: Logistic regression function for binary classiĄcation
Running the example generates a prediction for each example in the training dataset then prints
the classiĄcation accuracy for the predictions.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
We would expect about 50 percent accuracy given a set of random weights and a dataset
with an equal number of examples in each class, and that is approximately what we see in this
case.
27.4 Optimize a Logistic Regression Model 328
Accuracy: 0.540000
Output 27.5: Result from Program 27.19
We can now optimize the weights of the dataset to achieve good accuracy on this dataset.
The stochastic hill climbing algorithm used for linear regression can be used again for logistic
regression. The important difference is an update to the objective() function to round the
predictions and evaluate the model using classiĄcation accuracy instead of mean squared error.
The hillclimbing() function also must be updated to maximize the score of solutions instead
of minimizing in the case of linear regression.
Finally, the coefficients found by the search can be evaluated using classiĄcation accuracy at
the end of the run.
...
# generate predictions for the test dataset
yhat = predict_dataset(X_test, coefficients)
# round predictions to labels
yhat = [round(y) for y in yhat]
# calculate accuracy
score = accuracy_score(y_test, yhat)
print('Test Accuracy: %f' % (score))
Program 27.22: Evaluate the model with the coefficients found by the search
27.4 Optimize a Logistic Regression Model 329
Tying this all together, the complete example of using stochastic hill climbing to maximize
classiĄcation accuracy of a logistic regression model is listed below.
# logistic regression
def predict_row(row, coefficients):
# add the bias, the last coefficient
result = coefficients[-1]
# add the weighted input
for i in range(len(row)):
result += coefficients[i] * row[i]
# logistic function
logistic = 1.0 / (1.0 + exp(-result))
return logistic
# objective function
def objective(X, y, coefficients):
# generate predictions for dataset
yhat = predict_dataset(X, coefficients)
# round predictions to labels
yhat = [round(y) for y in yhat]
# calculate accuracy
score = accuracy_score(y, yhat)
return score
# define dataset
❈
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, ❈
n_redundant=1, random_state=1)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# define the total iterations
n_iter = 2000
# define the maximum step size
step_size = 0.1
# determine the number of coefficients
n_coef = X.shape[1] + 1
# define the initial solution
solution = rand(n_coef)
# perform the hill climbing search
❈
coefficients, score = hillclimbing(X_train, y_train, objective, solution, n_iter, ❈
step_size)
print('Done!')
print('Coefficients: %s' % coefficients)
print('Train Accuracy: %f' % (score))
# generate predictions for the test dataset
yhat = predict_dataset(X_test, coefficients)
# round predictions to labels
yhat = [round(y) for y in yhat]
# calculate accuracy
score = accuracy_score(y_test, yhat)
print('Test Accuracy: %f' % (score))
Program 27.23: Optimize logistic regression model with a stochastic hill climber
Running the example will report the iteration number and classiĄcation accuracy each time
there is an improvement made to the model. At the end of the search, the performance of the
best set of coefficients on the training dataset is reported and the performance of the same
model on the test dataset is calculated and reported.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the optimization algorithm found a set of weights that achieved
about 87.3 percent accuracy on the training dataset and about 83.9 percent accuracy on the
test dataset.
...
>200 0.85672
>225 0.85672
27.5 Further Reading 331
>230 0.85672
>245 0.86418
>281 0.86418
>285 0.86716
>294 0.86716
>306 0.86716
>316 0.86716
>317 0.86716
>320 0.86866
>348 0.86866
>362 0.87313
>784 0.87313
>1649 0.87313
Done!
❈
Coefficients: [-0.04652756 0.23243427 2.58587637 -0.45528253 -0.4954355 ❈
-0.42658053]
Train Accuracy: 0.873134
Test Accuracy: 0.839394
Output 27.6: Result from Program 27.23
APIs
sklearn.datasets.make_regression API.
https : / / scikit - learn . org / stable / modules / generated / sklearn . datasets . make _
regression.html
sklearn.datasets.make_classification API.
https : / / scikit - learn . org / stable / modules / generated / sklearn . datasets . make _
classification.html
sklearn.metrics.mean_squared_error API.
https : / / scikit - learn . org / stable / modules / generated / sklearn . metrics . mean _
squared_error.html
numpy.random.rand API.
https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.
html
Articles
Linear regression. Wikipedia.
https://en.wikipedia.org/wiki/Linear_regression
Logistic regression. Wikipedia.
https://en.wikipedia.org/wiki/Logistic_regression
27.6 Summary 332
27.6 Summary
In this tutorial, you discovered how to manually optimize the coefficients of regression models.
SpeciĄcally, you learned:
▷ How to develop the inference models for regression from scratch.
▷ How to optimize the coefficients of a linear regression model for predicting numeric
values.
▷ How to optimize the coefficients of a logistic regression model using stochastic hill
climbing.
Next, we will see another example, using stochastic hill climbing to optimize a neural
network.
Optimize Neural Network
Models
28
Deep learning neural network models are Ąt on training data using the stochastic gradient
descent optimization algorithm. Updates to the weights of the model are made, using the
backpropagation of error algorithm. The combination of the optimization and weight update
algorithm was carefully chosen and is the most efficient approach known to Ąt neural networks.
Nevertheless, it is possible to use alternate optimization algorithms to Ąt a neural network model
to a training dataset. This can be a useful exercise to learn more about how neural networks
function and the central nature of optimization in applied machine learning. It may also be
required for neural networks with unconventional model architectures and non-differentiable
transfer functions.
In this tutorial, you will discover how to manually optimize the weights of neural network
models. After completing this tutorial, you will know:
▷ How to develop the forward inference pass for neural network models from scratch.
▷ How to optimize the weights of a Perceptron model for binary classiĄcation.
▷ How to optimize the weights of a Multilayer Perceptron model using stochastic hill
climbing.
LetŠs get started.
a numeric output that can be interpreted for classiĄcation or regression predictive modeling.
Models are trained by repeatedly exposing the model to examples of input and output and
adjusting the weights to minimize the error of the modelŠs output compared to the expected
output. This is called the stochastic gradient descent optimization algorithm. The weights of
the model are adjusted using a speciĄc rule from calculus that assigns error proportionally to
each weight in the network. This is called the backpropagation algorithm.
The stochastic gradient descent optimization algorithm with weight updates made using
backpropagation is the best way to train neural network models. However, it is not the only
way to train a neural network. It is possible to use any arbitrary optimization algorithm to
train a neural network model. That is, we can deĄne a neural network model architecture
and use a given optimization algorithm to Ąnd a set of weights for the model that results
in a minimum of prediction error or a maximum of classiĄcation accuracy. Using alternate
optimization algorithms is expected to be less efficient on average than using stochastic gradient
descent with backpropagation. Nevertheless, it may be more efficient in some speciĄc cases,
such as non-standard network architectures or non-differential transfer functions.
It can also be an interesting exercise to demonstrate the central nature of optimization
in training machine learning algorithms, and speciĄcally neural networks. Next, letŠs explore
how to train a simple one-node neural network called a Perceptron model using stochastic hill
climbing.
Running the example prints the shape of the created dataset, conĄrming our expectations.
(1000, 5) (1000,)
Output 28.1: Result from Program 28.1
1
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html
28.3 Optimize a Perceptron Model 335
Next, we need to deĄne a Perceptron model. The Perceptron model has a single node that has
one input weight for each column in the dataset. Each input is multiplied by its corresponding
weight to give a weighted sum and a bias weight is then added, like an intercept coefficient
in a regression model. This weighted sum is called the activation. Finally, the activation is
interpreted and used to predict the class label, 1 for a positive activation and 0 for a negative
activation. Before we optimize the model weights, we must develop the model and our conĄdence
in how it works.
LetŠs start by deĄning a function for interpreting the activation of the model. This is called
the activation function, or the transfer function; the latter name is more traditional and is my
preference. The transfer() function below takes the activation of the model and returns a
class label, class=1 for a positive or zero activation and class=0 for a negative activation. This
is called a step transfer function.
def transfer(activation):
if activation >= 0.0:
return 1
return 0
Program 28.2: Transfer function
Next, we can develop a function that calculates the activation of the model for a given input
row of data from the dataset. This function will take the row of data and the weights for the
model and calculate the weighted sum of the input with the addition of the bias weight. The
activate() function below implements this.
Note: We are using simple Python lists and imperative programming style instead
ò of NumPy arrays or list compressions intentionally to make the code more readable
for Python beginners.
Next, we can use the activate() and transfer() functions together to generate a prediction
for a given row of data. The predict_row() function below implements this.
Next, we can call the predict_row() function for each row in a given dataset. The
predict_dataset() function below implements this. Again, we are intentionally using simple
imperative coding style for readability instead of list comprehension.
Finally, we can use the model to make predictions on our synthetic dataset to conĄrm it is all
working correctly. We can generate a random set of model weights using the rand() function2 .
Recall that we need one weight for each input (Ąve inputs in this dataset) plus an extra weight
for the bias weight.
...
# define dataset
❈
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, ❈
n_redundant=1, random_state=1)
# determine the number of weights
n_weights = X.shape[1] + 1
# generate random weights
weights = rand(n_weights)
Program 28.6: DeĄne dataset and generate random weights
We can then use these weights with the dataset to make predictions.
...
yhat = predict_dataset(X, weights)
Program 28.7: Generate predictions for dataset
...
score = accuracy_score(y, yhat)
print(score)
Program 28.8: Calculate accuracy
ThatŠs it. We can tie all of this together and demonstrate our simple Perceptron model for
classiĄcation. The complete example is listed below.
2
https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.html
28.3 Optimize a Perceptron Model 337
# transfer function
def transfer(activation):
if activation >= 0.0:
return 1
return 0
# activation function
def activate(row, weights):
# add the bias, the last weight
activation = weights[-1]
# add the weighted input
for i in range(len(row)):
activation += weights[i] * row[i]
return activation
# define dataset
❈
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, ❈
n_redundant=1, random_state=1)
# determine the number of weights
n_weights = X.shape[1] + 1
# generate random weights
weights = rand(n_weights)
# generate predictions for dataset
yhat = predict_dataset(X, weights)
# calculate accuracy
score = accuracy_score(y, yhat)
print(score)
Program 28.9: Simple perceptron model for binary classiĄcation
Running the example generates a prediction for each example in the training dataset then prints
the classiĄcation accuracy for the predictions.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
28.3 Optimize a Perceptron Model 338
We would expect about 50 percent accuracy given a set of random weights and a dataset
with an equal number of examples in each class, and that is approximately what we see in this
case.
0.548
Output 28.2: Result from Program 28.9
We can now optimize the weights of the dataset to achieve good accuracy on this dataset. First,
we need to split the dataset into train and test sets. It is important to hold back some data not
used in optimizing the model so that we can prepare a reasonable estimate of the performance
of the model when used to make predictions on new data. We will use 67 percent of the data for
training and the remaining 33 percent as a test set for evaluating the performance of the model.
...
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
Program 28.10: Split data into train test sets
Next, we can develop a stochastic hill climbing algorithm. The optimization algorithm requires
an objective function to optimize. It must take a set of weights and return a score that is to be
minimized or maximized corresponding to a better model. In this case, we will evaluate the
accuracy of the model with a given set of weights and return the classiĄcation accuracy, which
must be maximized. The objective() function below implements this, given the dataset and a
set of weights, and returns the accuracy of the model
Next, we can deĄne the stochastic hill climbing algorithm. The algorithm will require an initial
solution (e.g. random weights) and will iteratively keep making small changes to the solution and
checking if it results in a better performing model. The amount of change made to the current
solution is controlled by a step_size hyperparameter. This process will continue for a Ąxed
number of iterations, also provided as a hyperparameter. The hillclimbing() function below
implements this, taking the dataset, objective function, initial solution, and hyperparameters as
arguments and returns the best set of weights found and the estimated performance.
We can then call this function, passing in a set of weights as the initial solution and the training
dataset as the dataset to optimize the model against.
...
# define the total iterations
n_iter = 1000
# define the maximum step size
step_size = 0.05
# determine the number of weights
n_weights = X.shape[1] + 1
# define the initial solution
solution = rand(n_weights)
# perform the hill climbing search
❈
weights, score = hillclimbing(X_train, y_train, objective, solution, n_iter, ❈
step_size)
print('Done!')
print('f(%s) = %f' % (weights, score))
Program 28.13: Perform hill climbing search
Finally, we can evaluate the best model on the test dataset and report the performance.
...
# generate predictions for the test dataset
yhat = predict_dataset(X_test, weights)
# calculate accuracy
score = accuracy_score(y_test, yhat)
print('Test Accuracy: %.5f' % (score * 100))
Program 28.14: Evaluate the best model on the test dataset
Tying this together, the complete example of optimizing the weights of a Perceptron model on
the synthetic binary optimization dataset is listed below.
# transfer function
def transfer(activation):
28.3 Optimize a Perceptron Model 340
# activation function
def activate(row, weights):
# add the bias, the last weight
activation = weights[-1]
# add the weighted input
for i in range(len(row)):
activation += weights[i] * row[i]
return activation
# objective function
def objective(X, y, weights):
# generate predictions for dataset
yhat = predict_dataset(X, weights)
# calculate accuracy
score = accuracy_score(y, yhat)
return score
# define dataset
❈
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, ❈
n_redundant=1, random_state=1)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# define the total iterations
n_iter = 1000
# define the maximum step size
step_size = 0.05
# determine the number of weights
n_weights = X.shape[1] + 1
# define the initial solution
solution = rand(n_weights)
# perform the hill climbing search
❈
weights, score = hillclimbing(X_train, y_train, objective, solution, n_iter, ❈
step_size)
print('Done!')
print('f(%s) = %f' % (weights, score))
# generate predictions for the test dataset
yhat = predict_dataset(X_test, weights)
# calculate accuracy
score = accuracy_score(y_test, yhat)
print('Test Accuracy: %.5f' % (score * 100))
Program 28.15: Hill climbing to optimize weights of a perceptron model for
classiĄcation
Running the example will report the iteration number and classiĄcation accuracy each time
there is an improvement made to the model. At the end of the search, the performance of the
best set of weights on the training dataset is reported and the performance of the same model
on the test dataset is calculated and reported.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the optimization algorithm found a set of weights that achieved
about 88.5 percent accuracy on the training dataset and about 81.8 percent accuracy on the
test dataset.
...
>111 0.88060
>119 0.88060
>126 0.88209
>134 0.88209
>205 0.88209
>262 0.88209
>280 0.88209
>293 0.88209
>297 0.88209
>336 0.88209
28.4 Optimize a Multilayer Perceptron 342
>373 0.88209
>437 0.88358
>463 0.88507
>630 0.88507
>701 0.88507
Done!
f([ 0.0097317 0.13818088 1.17634326 -0.04296336 0.00485813 -0.14767616]) = 0.885075
Test Accuracy: 81.81818
Output 28.3: Result from Program 28.15
Now that we are familiar with how to manually optimize the weights of a Perceptron model,
letŠs look at how we can extend the example to optimize the weights of a Multilayer Perceptron
(MLP) model.
def transfer(activation):
# sigmoid transfer function
return 1.0 / (1.0 + exp(-activation))
Program 28.16: Transfer function
We can use the same activate() function from the previous section. Here, we will use it to
calculate the activation for each node in a given layer. The predict_row() function must be
replaced with a more elaborate version. The function takes a row of data and the network and
returns the output of the network. We will deĄne our network as a list of lists. Each layer will be
a list of nodes and each node will be a list or array of weights. To calculate the prediction of the
network, we simply enumerate the layers, then enumerate nodes, then calculate the activation
and transfer output for each node. In this case, we will use the same transfer function for all
nodes in the network, although this does not have to be the case. For networks with more than
one layer, the output from the previous layer is used as input to each node in the next layer.
The output from the Ąnal layer in the network is then returned. The predict_row() function
below implements this.
28.4 Optimize a Multilayer Perceptron 343
ThatŠs about it. Finally, we need to deĄne a network to use. For example, we can deĄne an
MLP with a single hidden layer with a single node as follows:
...
node = rand(n_inputs + 1)
layer = [node]
network = [layer]
Program 28.18: Create a one node network
This is practically a Perceptron, although with a sigmoid transfer function. Quite boring. LetŠs
deĄne an MLP with one hidden layer and one output layer. The Ąrst hidden layer will have 10
nodes, and each node will take the input pattern from the dataset (e.g. Ąve inputs). The output
layer will have a single node that takes inputs from the outputs of the Ąrst hidden layer and
then outputs a prediction. We assume a bias term exists, hence the +1 below.
...
n_hidden = 10
hidden1 = [rand(n_inputs + 1) for _ in range(n_hidden)]
output1 = [rand(n_hidden + 1)]
network = [hidden1, output1]
Program 28.19: One hidden layer and an output layer
...
yhat = predict_dataset(X, network)
Program 28.20: Generate predictions for dataset
Before we calculate the classiĄcation accuracy, we must round the predictions to class labels 0
and 1.
28.4 Optimize a Multilayer Perceptron 344
...
# round the predictions
yhat = [round(y) for y in yhat]
# calculate accuracy
score = accuracy_score(y, yhat)
print(score)
Program 28.21: Calculate classiĄcation accuracy
Tying this all together, the complete example of evaluating an MLP with random initial weights
on our synthetic binary classiĄcation dataset is listed below.
# transfer function
def transfer(activation):
# sigmoid transfer function
return 1.0 / (1.0 + exp(-activation))
# activation function
def activate(row, weights):
# add the bias, the last weight
activation = weights[-1]
# add the weighted input
for i in range(len(row)):
activation += weights[i] * row[i]
return activation
yhats.append(yhat)
return yhats
# define dataset
❈
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, ❈
n_redundant=1, random_state=1)
# determine the number of inputs
n_inputs = X.shape[1]
# one hidden layer and an output layer, each perceptron has a bias term
n_hidden = 10
hidden1 = [rand(n_inputs + 1) for _ in range(n_hidden)]
output1 = [rand(n_hidden + 1)]
network = [hidden1, output1]
# generate predictions for dataset
yhat = predict_dataset(X, network)
# round the predictions
yhat = [round(y) for y in yhat]
# calculate accuracy
score = accuracy_score(y, yhat)
print(score)
Program 28.22: Develop an MLP model for classiĄcation
Running the example generates a prediction for each example in the training dataset, then
prints the classiĄcation accuracy for the predictions.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
Again, we would expect about 50 percent accuracy given a set of random weights and a
dataset with an equal number of examples in each class, and that is approximately what we see
in this case.
0.499
Output 28.4: Result from Program 28.22
Next, we can apply the stochastic hill climbing algorithm to the dataset. It is very much the
same as applying hill climbing to the Perceptron model, except in this case, a step requires a
modiĄcation to all weights in the network. For this, we will develop a new function that creates
a copy of the network and mutates each weight in the network while making the copy. The
step() function below implements this.
Modifying all weight in the network is aggressive. A less aggressive step in the search space
might be to make a small change to a subset of the weights in the model, perhaps controlled by
a hyperparameter. This is left as an extension. We can then call this new step() function from
the hillclimbing() function.
Tying this together, the complete example of applying stochastic hill climbing to optimize the
weights of an MLP model for binary classiĄcation is listed below.
# transfer function
def transfer(activation):
# sigmoid transfer function
return 1.0 / (1.0 + exp(-activation))
# activation function
def activate(row, weights):
# add the bias, the last weight
activation = weights[-1]
28.4 Optimize a Multilayer Perceptron 347
# objective function
def objective(X, y, network):
# generate predictions for dataset
yhat = predict_dataset(X, network)
# round the predictions
yhat = [round(y) for y in yhat]
# calculate accuracy
score = accuracy_score(y, yhat)
return score
new_net.append(new_layer)
return new_net
# define dataset
❈
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, ❈
n_redundant=1, random_state=1)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# define the total iterations
n_iter = 1000
# define the maximum step size
step_size = 0.1
# determine the number of inputs
n_inputs = X.shape[1]
# one hidden layer and an output layer
n_hidden = 10
hidden1 = [rand(n_inputs + 1) for _ in range(n_hidden)]
output1 = [rand(n_hidden + 1)]
network = [hidden1, output1]
# perform the hill climbing search
❈
network, score = hillclimbing(X_train, y_train, objective, network, n_iter, ❈
step_size)
print('Done!')
print('Best: %f' % (score))
# generate predictions for the test dataset
yhat = predict_dataset(X_test, network)
# round the predictions
yhat = [round(y) for y in yhat]
# calculate accuracy
score = accuracy_score(y_test, yhat)
print('Test Accuracy: %.5f' % (score * 100))
Program 28.25: Stochastic hill climbing to optimize a multilayer perceptron for
classiĄcation
Running the example will report the iteration number and classiĄcation accuracy each time
there is an improvement made to the model. At the end of the search, the performance of the
28.5 Further Reading 349
best set of weights on the training dataset is reported and the performance of the same model
on the test dataset is calculated and reported.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the optimization algorithm found a set of weights that achieved
about 87.3 percent accuracy on the training dataset and about 85.1 percent accuracy on the
test dataset.
...
>55 0.755224
>56 0.765672
>59 0.794030
>66 0.805970
>77 0.835821
>120 0.838806
>165 0.840299
>188 0.841791
>218 0.846269
>232 0.852239
>237 0.852239
>239 0.855224
>292 0.867164
>368 0.868657
>823 0.868657
>852 0.871642
>889 0.871642
>892 0.871642
>992 0.873134
Done!
Best: 0.873134
Test Accuracy: 85.15152
Output 28.5: Result from Program 28.25
APIs
sklearn.datasets.make_classification API.
https : / / scikit - learn . org / stable / modules / generated / sklearn . datasets . make _
classification.html
sklearn.metrics.accuracy_score API.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_
score.html
28.6 Summary 350
numpy.random.rand API.
https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.
html
28.6 Summary
In this tutorial, you discovered how to manually optimize the weights of neural network models.
SpeciĄcally, you learned:
▷ How to develop the forward inference pass for neural network models from scratch.
▷ How to optimize the weights of a Perceptron model for binary classiĄcation.
▷ How to optimize the weights of a Multilayer Perceptron model using stochastic hill
climbing.
Next, we will try another project to use optimization algorithm for feature selection.
Feature Selection using
Stochastic Optimization
29
Typically, a simpler and better-performing machine learning model can be developed by removing
input features (columns) from the training dataset. This is called feature selection and there
are many different types of algorithms that can be used. It is possible to frame the problem of
feature selection as an optimization problem. In the case that there are few input features, all
possible combinations of input features can be evaluated and the best subset found deĄnitively.
In the case of a vast number of input features, a stochastic optimization algorithm can be used
to explore the search space and Ąnd an effective subset of features.
In this tutorial, you will discover how to use optimization algorithms for feature selection
in machine learning. After completing this tutorial, you will know:
▷ The problem of feature selection can be broadly deĄned as an optimization problem.
▷ How to enumerate all possible subsets of input features for a dataset.
▷ How to apply stochastic optimization to select an optimal subset of input features.
LetŠs get started.
result in the best performing model according to a performance metric. These methods are
unconcerned with the variable types, although they can be computationally expensive. RFE is
a good example of a wrapper feature selection method. Filter feature selection methods use
statistical techniques to evaluate the relationship between each input variable and the target
variable, and these scores are used as the basis to choose (Ąlter) those input variables that will
be used in the model.
▷ Wrapper Feature Selection: Search for well-performing subsets of features.
▷ Filter Feature Selection: Select subsets of features based on their relationship with the
target.
A popular wrapper method is the Recursive Feature Elimination, or RFE, algorithm. RFE works
by searching for a subset of features by starting with all features in the training dataset and
successfully removing features until the desired number remains. This is achieved by Ątting the
given machine learning algorithm used in the core of the model, ranking features by importance,
discarding the least important features, and re-Ątting the model. This process is repeated until
a speciĄed number of features remains.
The problem of wrapper feature selection can be framed as an optimization problem. That
is, Ąnd a subset of input features that result in the best model performance. RFE is one
approach to solving this problem systematically, although it may be limited by a large number
of features. An alternative approach would be to use a stochastic optimization algorithm, such
as a stochastic hill climbing algorithm, when the number of features is very large. When the
number of features is relatively small, it may be possible to enumerate all possible subsets of
features.
▷ Few Input Variables: Enumerate all possible subsets of features.
▷ Many Input Features: Stochastic optimization algorithm to Ąnd good subsets of features.
Now that we are familiar with the idea that feature selection may be explored as an optimization
problem, letŠs look at how we might enumerate all possible feature subsets.
1
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html
29.3 Enumerate All Feature Subsets 353
Running the example creates the dataset and conĄrms that it has the desired shape.
(1000, 5) (1000,)
Output 29.1: Result from Program 29.1
Next, we can establish a baseline in performance using a model evaluated on the entire dataset.
We will use a DecisionTreeClassifier2 as the model because its performance is quite sensitive
to the choice of input variables. We will evaluate the model using good practices, such as
repeated stratiĄed k-fold cross-validation with three repeats and 10 folds. The complete example
is listed below.
Running the example evaluates the decision tree on the entire dataset and reports the mean
and standard deviation classiĄcation accuracy.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the model achieved an accuracy of about 80.5 percent.
2
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
29.3 Enumerate All Feature Subsets 354
Next, we can try to improve model performance by using a subset of the input features.
First, we must choose a representation to enumerate. In this case, we will enumerate a
list of boolean values, with one value for each input feature: True if the feature is to be
used and False if the feature is not to be used as input. For example, with the Ąve input
features the sequence [True, True, True, True, True] would use all input features, and
[True, False, False, False, False] would only use the Ąrst input feature as input. We
can enumerate all sequences of boolean values with the length=5 using the product() Python
function3 . We must specify the valid values [True, False] and the number of steps in the
sequence, which is equal to the number of input variables. The function returns an iterable that
we can enumerate directly for each sequence.
...
# determine the number of columns
n_cols = X.shape[1]
best_subset, best_score = None, 0.0
# enumerate all combinations of input features
for subset in product([True, False], repeat=n_cols):
...
Program 29.3: Enumerate combinations of input features
For a given sequence of boolean values, we can enumerate it and transform it into a sequence of
column indexes for each True in the sequence.
...
ix = [i for i, x in enumerate(subset) if x]
Program 29.4: Convert the sequence into column indexes
If the sequence has no column indexes (in the case of all False values), then we can skip that
sequence.
if len(ix) == 0:
continue
Program 29.5: Skip the sequence if no column
We can then use the column indexes to choose the columns in the dataset.
...
X_new = X[:, ix]
Program 29.6: Select columns
And this subset of the dataset can then be evaluated as we did before.
...
# define model
model = DecisionTreeClassifier()
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
3
https://docs.python.org/3/library/itertools.html#itertools.product
29.3 Enumerate All Feature Subsets 355
# evaluate model
scores = cross_val_score(model, X_new, y, scoring='accuracy', cv=cv, n_jobs=-1)
# summarize scores
result = mean(scores)
Program 29.7: Evaluate the subset
If the accuracy for the model is better than the best sequence found so far, we can store it.
...
if best_score is None or result >= best_score:
# better result
best_subset, best_score = ix, result
Program 29.8: Check if it is better than the best so far
And thatŠs it. Tying this together, the complete example of feature selection by enumerating all
possible feature subsets is listed below.
# report best
print('Done!')
print('f(%s) = %f' % (best_subset, best_score))
Program 29.9: Feature selection by enumerating all possible subsets of features
Running the example reports the mean classiĄcation accuracy of the model for each subset of
features considered. The best subset is then reported at the end of the run.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the best subset of features involved features at indexes [2, 3,
4] that resulted in a mean classiĄcation accuracy of about 83.0 percent, which is better than the
result reported previously using all input features.
Now that we know how to enumerate all possible feature subsets, letŠs look at how we might
use a stochastic optimization algorithm to choose a subset of features.
Running the example creates the dataset and conĄrms that it has the desired shape.
We can establish a baseline in performance by evaluating a model on the dataset with all input
features. Because the dataset is large and the model is slow to evaluate, we will modify the
evaluation of the model to use 3-fold cross-validation, i.e. fewer folds and no repeats. The
complete example is listed below.
Running the example evaluates the decision tree on the entire dataset and reports the mean
and standard deviation classiĄcation accuracy.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the model achieved an accuracy of about 91.3 percent. This
provides a baseline that we would expect to outperform using feature selection.
We will use a simple stochastic hill climbing algorithm as the optimization algorithm. First, we
must deĄne the objective function. It will take the dataset and a subset of features to use as
input and return an estimated model accuracy from 0 (worst) to 1 (best). It is a maximizing
optimization problem. This objective function is simply the decoding of the sequence and model
evaluation step from the previous section. The objective() function below implements this
and returns both the score and the decoded subset of columns used for helpful reporting.
We also need a function that can take a step in the search space. Given an existing solution,
it must modify it and return a new solution in close proximity. In this case, we will achieve
this by randomly Ćipping the inclusion/exclusion of columns in subsequence. Each position
in the sequence will be considered independently and will be Ćipped probabilistically where
the probability of Ćipping is a hyperparameter. The mutate() function below implements this
given a candidate solution (sequence of booleans) and a mutation hyperparameter, creating and
returning a modiĄed solution (a step in the search space). The larger the p_mutate value (in
the range 0 to 1), the larger the step in the search space.
29.4 Optimize Feature Subsets 359
We can now implement the hill climbing algorithm. The initial solution is a randomly generated
sequence, which is then evaluated.
...
# generate an initial point
solution = choice([True, False], size=X.shape[1])
# evaluate the initial point
solution_eval, ix = objective(X, y, solution)
Program 29.14: Evaluate a random sequence
We then loop for a Ąxed number of iterations, creating mutated versions of the current solution,
evaluating them, and saving them if the score is better.
...
for i in range(n_iter):
# take a step
candidate = mutate(solution, p_mutate)
# evaluate candidate point
candidate_eval, ix = objective(X, y, candidate)
# check if we should keep the new point
if candidate_eval >= solution_eval:
# store the new point
solution, solution_eval = candidate, candidate_eval
# report progress
print('>%d f(%s) = %f' % (i+1, len(ix), solution_eval))
Program 29.15: Hill climbing
The hillclimbing() function below implements this, taking the dataset, objective function,
and hyperparameters as arguments and returns the best subset of dataset columns and the
estimated performance of the model.
We can then call this function and pass in our synthetic dataset to perform optimization for
feature selection. In this case, we will run the algorithm for 100 iterations and make about Ąve
Ćips to the sequence for a given mutation, which is quite conservative.
...
# define dataset
❈
X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, ❈
n_redundant=490, random_state=1)
# define the total iterations
n_iter = 100
# probability of including/excluding a column
p_mut = 10.0 / 500.0
# perform the hill climbing search
subset, score = hillclimbing(X, y, objective, n_iter, p_mut)
Program 29.17: Perform the hill climbing search
At the end of the run, we will convert the boolean sequence into column indexes (so we could Ąt
a Ąnal model if we wanted) and report the performance of the best subsequence.
...
ix = [i for i, x in enumerate(subset) if x]
print('Done!')
print('Best: f(%d) = %f' % (len(ix), score))
Program 29.18: Convert into column indexes
# objective function
def objective(X, y, subset):
# convert into column indexes
ix = [i for i, x in enumerate(subset) if x]
# check for now column (all False)
if len(ix) == 0:
29.4 Optimize Feature Subsets 361
return 0.0
# select columns
X_new = X[:, ix]
# define model
model = DecisionTreeClassifier()
# evaluate model
scores = cross_val_score(model, X_new, y, scoring='accuracy', cv=3, n_jobs=-1)
# summarize scores
result = mean(scores)
return result, ix
# mutation operator
def mutate(solution, p_mutate):
# make a copy
child = solution.copy()
for i in range(len(child)):
# check for a mutation
if rand() < p_mutate:
# flip the inclusion
child[i] = not child[i]
return child
# define dataset
❈
X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, ❈
n_redundant=490, random_state=1)
# define the total iterations
n_iter = 100
# probability of including/excluding a column
p_mut = 10.0 / 500.0
# perform the hill climbing search
subset, score = hillclimbing(X, y, objective, n_iter, p_mut)
# convert into column indexes
ix = [i for i, x in enumerate(subset) if x]
29.5 Further Reading 362
print('Done!')
print('Best: f(%d) = %f' % (len(ix), score))
Program 29.19: Stochastic optimization for feature selection
Running the example reports the mean classiĄcation accuracy of the model for each subset of
features considered. The best subset is then reported at the end of the run.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the best performance was achieved with a subset of 239
features and a classiĄcation accuracy of approximately 91.8 percent. This is better than a model
evaluated on all input features. Although the result is better, we know we can do a lot better,
perhaps with tuning of the hyperparameters of the optimization algorithm or perhaps by using
an alternate optimization algorithm.
...
>80 f(240) = 0.918099
>81 f(236) = 0.918099
>82 f(238) = 0.918099
>83 f(236) = 0.918099
>84 f(239) = 0.918099
>85 f(240) = 0.918099
>86 f(239) = 0.918099
>87 f(245) = 0.918099
>88 f(241) = 0.918099
>89 f(239) = 0.918099
>90 f(239) = 0.918099
>91 f(241) = 0.918099
>92 f(243) = 0.918099
>93 f(245) = 0.918099
>94 f(239) = 0.918099
>95 f(245) = 0.918099
>96 f(244) = 0.918099
>97 f(242) = 0.918099
>98 f(238) = 0.918099
>99 f(248) = 0.918099
>100 f(238) = 0.918099
Done!
Best: f(239) = 0.918099
Output 29.6: Result from Program 29.19
APIs
sklearn.datasets.make_classification API.
https : / / scikit - learn . org / stable / modules / generated / sklearn . datasets . make _
classification.html
itertools.product API.
https://docs.python.org/3/library/itertools.html%5C#itertools.product
29.6 Summary
In this tutorial, you discovered how to use optimization algorithms for feature selection in
machine learning. SpeciĄcally, you learned:
▷ The problem of feature selection can be broadly deĄned as an optimization problem.
▷ How to enumerate all possible subsets of input features for a dataset.
▷ How to apply stochastic optimization to select an optimal subset of input features.
Next, we will see how optimization algorithms can be used to tune hyperparameters.
30
Manually Optimize Machine
Learning Model
Hyperparameters
Machine learning algorithms have hyperparameters that allow the algorithms to be tailored to
speciĄc datasets. Although the impact of hyperparameters may be understood generally, their
speciĄc effect on a dataset and their interactions during learning may not be known. Therefore,
it is important to tune the values of algorithm hyperparameters as part of a machine learning
project. It is common to use naive optimization algorithms to tune hyperparameters, such as a
grid search and a random search. An alternate approach is to use a stochastic optimization
algorithm, like a stochastic hill climbing algorithm.
In this tutorial, you will discover how to manually optimize the hyperparameters of machine
learning algorithms. After completing this tutorial, you will know:
▷ Stochastic optimization algorithms can be used instead of grid and random search for
hyperparameter optimization.
▷ How to use a stochastic hill climbing algorithm to tune the hyperparameters of the
Perceptron algorithm.
▷ How to manually optimize the hyperparameters of the XGBoost gradient boosting
algorithm.
LetŠs get started.
how to best set a hyperparameter and combinations of interacting hyperparameters for a given
dataset is challenging. A better approach is to objectively search different values for model
hyperparameters and choose a subset that results in a model that achieves the best performance
on a given dataset. This is called hyperparameter optimization, or hyperparameter tuning. A
range of different optimization algorithms may be used, although two of the simplest and most
common methods are random search and grid search.
▷ Random Search. DeĄne a search space as a bounded domain of hyperparameter values
and randomly sample points in that domain.
▷ Grid Search. DeĄne a search space as a grid of hyperparameter values and evaluate
every position in the grid.
Grid search is great for spot-checking combinations that are known to perform well generally.
Random search is great for discovery and getting hyperparameter combinations that you would
not have guessed intuitively, although it often requires more time to execute. Grid and random
search are primitive optimization algorithms, and it is possible to use any optimization we like
to tune the performance of a machine learning algorithm. For example, it is possible to use
stochastic optimization algorithms. This might be desirable when good or great performance is
required and there are sufficient resources available to tune the model.
Next, letŠs look at how we might use a stochastic hill climbing algorithm to tune the
performance of the Perceptron algorithm.
Running the example prints the shape of the created dataset, conĄrming our expectations.
(1000, 5) (1000,)
Output 30.1: Result from Program 30.1
1
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html
30.3 Perceptron Hyperparameter Optimization 366
The scikit-learn provides an implementation of the Perceptron model via the Perceptron class2 .
Before we tune the hyperparameters of the model, we can establish a baseline in performance
using the default hyperparameters. We will evaluate the model using good practices of repeated
stratiĄed k-fold cross-validation via the RepeatedStratifiedKFold class3 .
The complete example of evaluating the Perceptron model with default hyperparameters
on our synthetic binary classiĄcation dataset is listed below.
Running the example reports evaluates the model and reports the mean and standard deviation
of the classiĄcation accuracy.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the model with default hyperparameters achieved a classiĄcation
accuracy of about 78.5 percent. We would hope that we can achieve better performance than
this with optimized hyperparameters.
Next, we can optimize the hyperparameters of the Perceptron model using a stochastic hill
climbing algorithm. There are many hyperparameters that we could optimize, although we will
focus on two that perhaps have the most impact on the learning behavior of the model; they
are:
2
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html
3
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedStratified
KFold.html
30.3 Perceptron Hyperparameter Optimization 367
Next, we need a function to take a step in the search space. The search space is deĄned by
two variables (eta and alpha). A step in the search space must have some relationship to the
previous values and must be bound to sensible values (e.g. between 0 and 1). We will use a
Şstep_sizeŤ hyperparameter that controls how far the algorithm is allowed to move from the
existing conĄguration. A new conĄguration will be chosen probabilistically using a Gaussian
distribution with the current value as the mean of the distribution and the step size as the
standard deviation of the distribution. We can use the randn() NumPy function4 to generate
random numbers with a Gaussian distribution. The step() function below implements this
and will take a step in the search space and generate a new conĄguration using an existing
conĄguration.
4
https://numpy.org/doc/stable/reference/random/generated/numpy.random.randn.html
30.3 Perceptron Hyperparameter Optimization 368
Next, we need to implement the stochastic hill climbing algorithm that will call our objective()
function to evaluate candidate solutions and our step() function to take a step in the search
space. The search Ąrst generates a random initial solution, in this case with eta and alpha
values in the range 0 and 1. The initial solution is then evaluated and is taken as the current
best working solution.
...
# starting point for the search
solution = [rand(), rand()]
# evaluate the initial point
solution_eval = objective(X, y, solution)
Program 30.5: Evaluate a random point
Next, the algorithm iterates for a Ąxed number of iterations provided as a hyperparameter to
the search. Each iteration involves taking a step and evaluating the new candidate solution.
...
# take a step
candidate = step(solution, step_size)
# evaluate candidate point
candidate_eval = objective(X, y, candidate)
Program 30.6: Evaluate the candidate solution
If the new solution is better than the current working solution, it is taken as the new current
working solution.
...
if candidate_eval >= solution_eval:
# store the new point
solution, solution_eval = candidate, candidate_eval
30.3 Perceptron Hyperparameter Optimization 369
# report progress
print('>%d, cfg=%s %.5f' % (i, solution, solution_eval))
Program 30.7: Check if we should keep the new point
At the end of the search, the best solution and its performance are then returned. Tying this
together, the hillclimbing() function below implements the stochastic hill climbing algorithm
for tuning the Perceptron algorithm, taking the dataset, objective function, number of iterations,
and step size as arguments.
We can then call the algorithm and report the results of the search. In this case, we will run the
algorithm for 100 iterations and use a step size of 0.1, chosen after a little trial and error.
...
# define the total iterations
n_iter = 100
# step size in the search space
step_size = 0.1
# perform the hill climbing search
cfg, score = hillclimbing(X, y, objective, n_iter, step_size)
print('Done!')
print('cfg=%s: Mean Accuracy: %f' % (cfg, score))
Program 30.9: Perform the hill climbing search
Tying this together, the complete example of manually tuning the Perceptron algorithm is listed
below.
# objective function
def objective(X, y, cfg):
# unpack config
eta, alpha = cfg
# define model
model = Perceptron(penalty='elasticnet', alpha=alpha, eta0=eta)
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# calculate mean accuracy
result = mean(scores)
return result
# define dataset
❈
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, ❈
n_redundant=1, random_state=1)
# define the total iterations
n_iter = 100
# step size in the search space
step_size = 0.1
# perform the hill climbing search
cfg, score = hillclimbing(X, y, objective, n_iter, step_size)
print('Done!')
print('cfg=%s: Mean Accuracy: %f' % (cfg, score))
Program 30.10: Manually search perceptron hyperparameters for binary classiĄcation
Running the example reports the conĄguration and result each time an improvement is seen
during the search. At the end of the run, the best conĄguration and result are reported.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the best result involved using a learning rate slightly above 1
at 1.004 and a regularization weight of about 0.002 achieving a mean accuracy of about 79.1
percent, better than the default conĄguration that achieved an accuracy of about 78.5 percent.
Now that we are familiar with how to use a stochastic hill climbing algorithm to tune the
hyperparameters of a simple machine learning algorithm, letŠs look at tuning a more advanced
algorithm, such as XGBoost.
30.4 XGBoost Hyperparameter Optimization 372
Once installed, you can conĄrm that it was installed successfully and that you are using a
modern version by running the following code:
import xgboost
print("xgboost", xgboost.__version__)
Program 30.11: Show the version of XGBoost
Running the code, you should see the following version number or higher.
xgboost 1.0.1
Although the XGBoost library has its own Python API, we can use XGBoost models with
the scikit-learn API via the XGBClassifier wrapper class5 . An instance of the model can be
instantiated and used just like any other scikit-learn class for model evaluation. For example:
...
model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")
Program 30.12: DeĄne model
5
https://xgboost.readthedocs.io/en/latest/python/python_api.html
30.4 XGBoost Hyperparameter Optimization 373
# define model
model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report result
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))
Program 30.13: XGBoost with default hyperparameters for binary classiĄcation
Running the example evaluates the model and reports the mean and standard deviation of the
classiĄcation accuracy.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the model with default hyperparameters achieved a classiĄcation
accuracy of about 84.9 percent. We would hope that we can achieve better performance than
this with optimized hyperparameters.
Next, we can adapt the stochastic hill climbing optimization algorithm to tune the
hyperparameters of the XGBoost model. There are many hyperparameters that we may
want to optimize for the XGBoost model. We will focus on four key hyperparameters; they are:
▷ Learning Rate (learning_rate)
▷ Number of Trees (n_estimators)
▷ Subsample Percentage (subsample)
▷ Tree Depth (max_depth)
The learning rate controls the contribution of each tree to the ensemble. Sensible values are
less than 1.0 and slightly above 0.0 (e.g. 10−8 ). The number of trees controls the size of the
ensemble, and often, more trees is better to a point of diminishing returns. Sensible values
are between 1 tree and hundreds or thousands of trees. The subsample percentage deĄne the
random sample size used to train each tree, deĄned as a percentage of the size of the original
dataset. Values are between a value slightly above 0.0 (e.g. 10−8 ) and 1.0. The tree depth is the
number of levels in each tree. Deeper trees are more speciĄc to the training dataset and perhaps
overĄt. Shorter trees often generalize better. Sensible values are between 1 and 10 or 20.
First, we must update the objective() function to unpack the hyperparameters of the
XGBoost model, conĄgure it, and then evaluate the mean classiĄcation accuracy.
30.4 XGBoost Hyperparameter Optimization 374
Next, we need to deĄne the step() function used to take a step in the search space. Each
hyperparameter is quite a different range, therefore, we will deĄne the step size (standard
deviation of the distribution) separately for each hyperparameter. We will also deĄne the step
sizes in line rather than as arguments to the function, to keep things simple. The number of
trees and the depth are integers, so the stepped values are rounded. The step sizes chosen are
arbitrary, chosen after a little trial and error. The updated step function is listed below.
def step(cfg):
# unpack config
lrate, n_tree, subsam, depth = cfg
# learning rate
lrate = lrate + randn() * 0.01
if lrate <= 0.0:
lrate = 1e-8
if lrate > 1:
lrate = 1.0
# number of trees
n_tree = round(n_tree + randn() * 50)
if n_tree <= 0.0:
n_tree = 1
# subsample percentage
subsam = subsam + randn() * 0.1
if subsam <= 0.0:
subsam = 1e-8
if subsam > 1:
subsam = 1.0
# max tree depth
depth = round(depth + randn() * 7)
if depth <= 1:
depth = 1
# return new config
return [lrate, n_tree, subsam, depth]
Program 30.15: Take a step in the search space
30.4 XGBoost Hyperparameter Optimization 375
Finally, the hillclimbing() algorithm must be updated to deĄne an initial solution with
appropriate values. In this case, we will deĄne the initial solution with sensible defaults,
matching the default hyperparameters, or close to them.
...
solution = step([0.1, 100, 1.0, 7])
Program 30.16: Set starting point for the search
Tying this together, the complete example of manually tuning the hyperparameters of the
XGBoost algorithm using a stochastic hill climbing algorithm is listed below.
# objective function
def objective(X, y, cfg):
# unpack config
lrate, n_tree, subsam, depth = cfg
# define model
❈
model = XGBClassifier(learning_rate=lrate, n_estimators=n_tree, ❈
❈subsample=subsam, max_depth=depth, use_label_encoder=False, ❈
eval_metric="logloss")
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# calculate mean accuracy
result = mean(scores)
return result
# define dataset
❈
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, ❈
n_redundant=1, random_state=1)
# define the total iterations
n_iter = 200
# perform the hill climbing search
cfg, score = hillclimbing(X, y, objective, n_iter)
print('Done!')
print('cfg=[%s]: Mean Accuracy: %f' % (cfg, score))
Program 30.17: XGBoost manual hyperparameter optimization for binary classiĄca-
tion
Running the example reports the conĄguration and result each time an improvement is seen
during the search. At the end of the run, the best conĄguration and result are reported.
Note: Your results may vary given the stochastic nature of the algorithm or
. evaluation procedure, or differences in numerical precision. Consider running the
example a few times and compare the average outcome.
In this case, we can see that the best result involved using a learning rate of about 0.02, 52
trees, a subsample rate of about 50 percent, and a large depth of 53 levels. This conĄguration
30.5 Further Reading 377
resulted in a mean accuracy of about 87.3 percent, better than the default conĄguration that
achieved an accuracy of about 84.9 percent.
APIs
sklearn.datasets.make_classification API.
https : / / scikit - learn . org / stable / modules / generated / sklearn . datasets . make _
classification.html
sklearn.metrics.accuracy_score API.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_
score.html
numpy.random.rand API.
https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.
html
sklearn.linear_model.Perceptron API.
https : / / scikit - learn . org / stable / modules / generated / sklearn . linear _ model .
Perceptron.html
Articles
Perceptron. Wikipedia.
https://en.wikipedia.org/wiki/Perceptron
30.6 Summary 378
XGBoost. Wikipedia.
https://en.wikipedia.org/wiki/XGBoost
30.6 Summary
In this tutorial, you discovered how to manually optimize the hyperparameters of machine
learning algorithms. SpeciĄcally, you learned:
▷ Stochastic optimization algorithms can be used instead of grid and random search for
hyperparameter optimization.
▷ How to use a stochastic hill climbing algorithm to tune the hyperparameters of the
Perceptron algorithm.
▷ How to manually optimize the hyperparameters of the XGBoost gradient boosting
algorithm.
This is the Ąnal chapter of this book. Well done!
VII
Appendix
Getting Help
A
This is just the beginning of your journey with function optimization. As you start to work
on projects and expand your existing knowledge of the techniques, you may need help. This
appendix points out some of the best sources of help.
A.2 Textbooks
There are some good textboooks on the topic. You can use it as a reference. A good textbook can
give you explanations in a great detail consistently. For the mathematical side of optimization,
these are the recommended:
▷ Stephen Boyd and Lieven Vandenberghe, Convex Optimization, Cambridge, 2004.
https://amzn.to/34mvCr1
▷ James C. Spall, Introduction to Stochastic Search and Optimization, Wiley, 2003.
https://amzn.to/34JYN7m
SpeciĄc to machine learning, many books have a chapter or two on the optimization algorithms,
such as
▷ Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, MIT Press, 2016.
https://amzn.to/3qSk3C2
A.3 NumPy and SciPy Resources 381
posting a question. Remember to search for your question before posting in case it has been
asked and answered before.
▷ Optimization tag on the Mathematics Stack Exchange.
https://math.stackexchange.com/?tags=optimization
▷ Mathematical Optimization tag on Stack OverĆow.
https://stackoverflow.com/questions/tagged/mathematical-optimization
▷ Mathematics and Machine Learning on Quora.
https://www.quora.com/topic/Mathematics-and-Machine-Learning
▷ Math Subreddit.
https://www.reddit.com/r/math/
Jason Brownlee
[email protected]
How to Setup Your Python
Environment
B
It can be difficult to install a Python machine learning environment on some platforms. Python
itself must be installed Ąrst and then there are many packages to install, and it can be confusing
for beginners. In this tutorial, you will discover how to setup a Python machine learning
development environment using Anaconda. After completing this tutorial, you will have a
working Python environment to begin learning, practicing, and developing machine learning
and deep learning software. These instructions are suitable for Windows, Mac OS X, and Linux
platforms. I will demonstrate them on Windows, so you may see some Windows dialogs and Ąle
extensions.
LetŠs get started.
B.1 Overview
In this tutorial, we will cover the following steps:
1. Download Anaconda
2. Install Anaconda
3. Start and Update Anaconda
4. Update scikit-learn Library
5. Install Deep Learning Libraries
2. Click ŞProductsŤ from the menu and click ŞIndividual EditionŤ to go to the download
page https://www.anaconda.com/products/individual-d/.
This will download the Anaconda Python package to your workstation. It will automatically
give you the installer according to your OS (Windows, Linux, or MacOS). The Ąle is about 480
MB. You should have a Ąle with a name like:
Anaconda3-2021.05-Windows-x86_64.exe
B.3 Install Anaconda 385
Installation is quick and painless. There should be no tricky questions or sticking points.
The installation should take less than 10 minutes and take a bit more than 5 GB of
space on your hard drive.
You can learn all about the Anaconda Navigator here: https://docs.continuum.io/an
aconda/navigator.html. You can use the Anaconda Navigator and graphical development
environments later; for now, I recommend starting with the Anaconda command line environment
called conda1 . Conda is fast, simple, itŠs hard for error messages to hide, and you can quickly
conĄrm your environment is installed and working correctly.
1. Open a terminal or CMD.exe prompt (command line window).
2. ConĄrm conda is installed correctly, by typing:
conda -V
conda 4.10.1
1
https://conda.pydata.org/docs/index.html
B.4 Start and Update Anaconda 387
python -V
Python 3.8.8
If the commands do not work or have an error, please check the documentation for help
for your platform. See some of the resources in the ŞFurther ReadingŤ section.
4. ConĄrm your conda environment is up-to-date, type:
You may need to install some packages and conĄrm the updates.
5. ConĄrm your SciPy environment.
The script below will print the version number of the key SciPy libraries you require
for machine learning development, speciĄcally: SciPy, NumPy, Matplotlib, Pandas,
Statsmodels, and Scikit-learn. You can type ŞpythonŤ and type the commands in
directly. Alternatively, I recommend opening a text editor and copy-pasting the script
into your editor.
# scipy
import scipy
print('scipy: %s' % scipy.__version__)
# numpy
import numpy
B.4 Start and Update Anaconda 388
Save the script as a Ąle with the name: versions.py. On the command line, change
your directory to where you saved the script and type:
python versions.py
scipy: 1.7.1
numpy: 1.20.3
matplotlib: 3.4.2
pandas: 1.3.2
statsmodels: 0.12.2
sklearn: 0.24.2
After you have the updated libraries, you can try a a scikit-learn tutorial, such as:
▷ Your First Machine Learning Project in Python Step-By-Step
https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
Alternatively, you may choose to install using pip and a speciĄc version of tensorĆow
for your platform.
B.6 Install Deep Learning Libraries 390
# keras
import keras
print('keras: %s' % keras.__version__)
# tensorflow
import tensorflow
print('tensorflow: %s' % tensorflow.__version__)
python deep_versions.py
keras: 2.4.3
tensorflow: 2.3.0
B.8 Summary
Congratulations, you now have a working Python development environment for machine learning
and deep learning. You can now learn and practice machine learning and deep learning on your
workstation.
How Far You Have Come
You made it. Well done. Take a moment and look back at how far you have come. You now
know:
▷ What is function optimization and why it is relevant and important to machine learning
▷ The trade-off in applying optimization algorithms, and the trade-off in tuning the
hyperparameters
▷ The difference between local optimal and global optimal
▷ How to visualize the progress and result of function optimization algorithms
▷ The stochastic nature of optimization algorithms
▷ Optimization by random search or grid search
▷ Carrying out local optimization by pattern search, quasi-Newton, least-square, and hill
climbing methods
▷ Carrying out global optimization using evolution algorithms and simulated annealing
▷ The difference in various gradient descent algorithms, including momentum, AdaGrad,
RMSProp, Adadelta, and Adam; and how to use them
▷ How to apply optimization to common machine learning tasks
DonŠt make light of this. You have come a long way in a short amount of time. You have
developed the important and valuable foundational skills in function optimization. You can now
conĄdently:
▷ Understand the optimization algorithm in machine learning papers.
▷ Implement the optimization descriptions of machine learning algorithms.
▷ Describe the optimization operations of your machine learning models.
The skyŠs the limit.
Thank You!
I want to take a moment and sincerely thank you for letting me help you start your optimization
journey. I hope you keep learning and have fun as you continue to master machine learning.
393
Jason Brownlee
2021Ű2023