Optimization For Machine Learning Mini Course
Optimization For Machine Learning Mini Course
MASTERY
Optimization
for Machine
Learning
7-Day Mini-Course
Jason Brownlee
i
Disclaimer
The information contained within this eBook is strictly for educational purposes. If you
wish to apply ideas contained in this eBook, you are taking full responsibility for your
actions.
The author has made every effort to ensure the accuracy of the information within this
book was correct at time of publication. The author does not assume and hereby
disclaims any liability to any party for any loss, damage, or disruption caused by errors
or omissions, whether such errors or omissions result from accident, negligence, or any
other cause.
No part of this eBook may be reproduced or transmitted in any form or by any means,
electronic or mechanical, recording or by any information storage and retrieval
system, without written permission from the author.
Mini-course overview
This crash course is broken down into seven lessons. You could complete one
lesson per day (recommended) or complete all of the lessons in one day
(hardcore). It really depends on the time you have available and your level of
enthusiasm. Below is a list of the seven lessons that will get you started and
productive with optimization for machine learning in Python:
▷ Lesson 01: Why optimize?
▷ Lesson 02: Grid search.
▷ Lesson 03: Optimization algorithms in SciPy.
▷ Lesson 04: BFGS algorithm.
▷ Lesson 05: Hill-climbing algorithm.
▷ Lesson 06: Simulated annealing.
▷ Lesson 07: Gradient descent.
Each lesson could take you 60 seconds or up to 30 minutes. Take your time
and complete the lessons at your own pace. Ask questions and even share your
results online. The lessons expect you to go off and find out how to do things. I
will give you hints, but part of the point of each lesson is to force you to learn
where to go to look for help on and about the statistical methods and the
NumPy API and the best-of-breed tools in Python. (Hint: I have all of the
answers directly on this blog; use the search box.) Share your results online,
I’ll cheer you on!
The function here does not mean you need to explicitly define a function in the
programming language. A conceptual one is suffice. What we want to do next is
to manipulate on the input and check the output until we found the best output
is achieved. In case of machine learning, the best can mean
▷ Highest accuracy, or precision, or recall
▷ Largest AUC of ROC
▷ Greatest F1 score in classification or R2 score in regression
▷ Least error, or log-loss
or something else in this line. We can manipulate the input by random methods
such as sampling or random perturbation. We can also assume the function has
certain properties and try out a sequence of inputs to exploit these properties.
Of course, we can also check all possible input and as we exhausted the
possibility, we will know the best answer.
These are the basics of why we want to do optimization, what it is about,
and how we can do it. You may not notice it, but training a machine learning
model is doing optimization. You may also explicitly perform optimization to
select features or fine-tune hyperparameters. As you can see, optimization is
useful in machine learning.
Your Task
For this lesson, you must find a machine learning model and list three examples
that optimization might be used or might help in training and using the model.
These may be related to some of the reasons above, or they may be your own
personal motivations.
Next
In the next lesson, you will discover how to perform grid search on an arbitrary
function.
Lesson
02 Grid
search
0
f (x, y) = x2 + y2
2
In this lesson, you will discover a gentle introduction to perform grid search for
optimization. Let’s start with this function:
# objective function
def objective(x, y):
return x**2.0 + y**2.0
Your Task
For this lesson, you should lookup how to use numpy.meshgrid() function and
rewrite the example code. Then you can try to replace the objective function
−
into f (x, y, z) = (x y+1)2 +z 2, which is a function with 3D input.
Next
In the next lesson, you will learn how to use scipy to optimize a function.
Lesson 03
Optimization algorithms in SciPy 0
3
In this lesson, you will discover how you can make use of SciPy to optimize
your function. There are a lot of optimization algorithms in the literature. Each
has its strengths and weaknesses, and each is good for a different kind of
situation. Reusing the same function we introduced in the previous lesson,
f (x, y) = x2 + y2
we can make use of some predefined algorithms in SciPy to find its minimum.
Probably the easiest is the Nelder-Mead algorithm. This algorithm is based on
a series of rules to determine how to explore the surface of the function.
Without going into
from scipy.optimize the detail,
import we can simply call SciPy and apply Nelder-Mead
minimize
algorithm to find a function’s
from numpy.random import rand minimum:
# objective function
def objective(x):
return x[0]**2.0 + x[1]**2.0
In the code above, we need to write our function with a single vector
argument. Hence virtually the function becomes
f (x[0], x[1]) = (x[0])2 +
(x[1])2
Lesson 03: Optimization algorithms in SciPy 8
Your Task
For this lesson, you should replace the objective function in the example code
above with the following:
This defined the Ackley function. The global minimum is at v=[0,0]. However,
Nelder-Mead most likely cannot find it because this function has many local
minima. Try repeat your code a few times and observe the output. You should
get a different output each time you run the program.
Next
In the next lesson, you will learn how to use the same SciPy function to apply
a different optimization algorithm.
Lesson 04
BFGS algorithm 0
4
In this lesson, you will discover how you can make use of SciPy to apply BFGS
algorithm to optimize your function. As we have seen in the previous lesson,
we can make use of the minimize() function from scipy.optimize to optimize a
function using Nelder-Meadd algorithm. This is the simple “pattern search”
algorithm that does not need to know the derivatives of a function.
First-order derivative means to differentiate the objective function once.
Similarly, second- order derivative is to differentiate the first-order derivative
one more time. If we have the second-order derivative of the objective function,
we can apply the Newton’s method to find its optimum.
There is another class of optimization algorithm that can approximate the
second-order derivative from the first order derivative, and use the
approximation to optimize the objective function. They are called the quasi-
Newton methods. BFGS is the most famous one of this class.
Revisiting the same objective function that we used in previous lessons,
f (x, y) = x2 + y2
we can tell that the first-order derivative is:
Σ Σ
2×x
∇f = 2 ×
y
This is a vector of two components, because the function f (x, y) receives a
vector value of two components (x, y) and returns a scalar value.
If we create a new function for the first-order derivative, we can call SciPy
and apply the BFGS algorithm:
# objective function
def objective(x):
return x[0]**2.0 + x[1]**2.0
def derivative(x):
return [x[0] * 2, x[1] * 2]
...
result = minimize(objective, pt, method='L-BFGS-B', jac=derivative)
Your Task
For this lesson, you should create a function with much more parameters (i.e.,
the vector argument to the function is much more than two components) and
observe the performance of BFGS and L-BFGS-B. Do you notice the difference
in speed? How different are the result from these two methods? What happen
if your function is not convex but have many local optima?
Next
In the next lesson, you will learn how to implement hill-climbing method.
Lesson 05
Hill-climbing algorithm 0
it to optimize your function.
5
In this lesson, you will discover how to implement hill-climbing algorithm and use
This hillclimbing function will randomly pick an initial point within the bound,
then test the objective function in iterations. Whenever it can find the
objective function yields a less value, the solution is remembered and the next
point to test is generated from its neighborhood.
Your Task
For this lesson, you should provide your own objective function (such as copy
over the one from previous lesson), set up the n_iterations and step_size and
apply the hillclimbing function to find the minimum. Observe how the
algorithm finds a solution. Try with different values of step_size and compare
the number of iterations needed to reach the proximity of the final solution.
Next
In the next lesson, you will learn how to implement simulated annealing.
Lesson 06
Simulated annealing 0
optima. The reason is because of the greedy nature of the algorithm:
6
In this lesson, you will discover how simulated annealing works and how to
use it. For the non-convex functions, the algorithms you learned in previous
lessons may be trapped easily at local optima and failed to find the global
Whenever a better solution is found, it will not let go. Hence if a even better
solution exists but not in the proximity, the algorithm will fail to find it.
Simulated annealing try to improve on this behavior by making a balance
between exploration and exploitation. At the beginning, when the algorithm is
not knowing much about the function to optimize, it prefers to explore other
solutions rather than stay with the best solution found. At later stage, as more
solutions are explored the chance of finding even better solutions is
diminished, the algorithm will prefer to remain in the neighborhood of the best
solution it found.
The following is the implementation of simulated annealing as a Python
function:
Your Task
For this lesson, you should repeat the exercise you did in the previous lesson
with the simulated annealing code above. Try with the objective function f (x,
y) = x2 + y2, which is a convex one. Do you see simulated annealing or hill
climbing takes less iteration? Replace the objective function with the Ackley
function introduced in Lesson 03. Do you see the minimum found by simulated
annealing or hill climbing is smaller?
Next
In the next lesson, you will learn how to implement
gradient descent.
Lesson 07
Gradient
descent
0
7
In this lesson, you will discover how you can implement gradient descent
algorithm. Gradient descent algorithm is the algorithm used to train a neural
network. Although there are many variants, all of them are based on gradient,
or the first-order derivative, of the function. The idea lies in the physical
meaning of a gradient of a function. If the function takes a vector and returns
a scalar value, the gradient of the function at any point will tell you the
direction that the function is increased the fastest. Hence if we aimed at
finding the minimum of the function, the direction we should explore is the
exact opposite of the gradient.
In mathematical equation, if we are looking for the minimum of f (x), where
x is a vector, and the gradient of f (x) is denoted by ∇f (x) (which is also a
vector), then we know
xnew = x − α × ∇f (x)
will be closer to the minimum than x. Now let’s try to implement this in
Python. Reusing the sample objective function and its derivative we learned in
Lesson 04, this is the gradient descent algorithm and its use to find the
minimum of the objective function:
# objective function
def objective(x):
return x[0]**2.0 + x[1]**2.0
# take a step
solution = solution - step_size * gradient # evaluate candidate point
solution_eval = objective(solution) # report progress
print('>%d f(%s) = %.5f' % (i, solution, solution_eval))
return [solution, solution_eval]
This algorithm depends on not only the objective function but also its
derivative. Hence it may not suitable for all kinds of problems. This algorithm
also sensitive to the step size, which a too large step size with respect to the
objective function may cause the gradient descent algorithm fail to converge.
If this happens, we will see the progress is not moving toward lower value.
There are several variations to make gradient descent algorithm more
robust, for example:
▷ Add a momentum into the process, which the move is not only following
the gradient but also partially the average of gradients in previous
iterations.
▷ Make the step sizes different for each component of the vector x
▷ Make the step size adaptive to the progress
Your Task
For this lesson, you should run the example program above with a different
step_size and n_iter and observe the difference in the progress of the
algorithm. At what step_size you will see the above program not converge?
Then try to add a new parameter β to the gradient_descent() function as the
momentum weight, which the update rule now becomes
xnew = x − α × ∇f (x) − β × g
where g is the average∇of f (x) in, for example, five previous iterations. Do you
see any improvement to this optimization? Is it a suitable example for using
momentum?
Final Word Before You Go...
You made it. Well done! Take a moment and look back at how far you have come.
You discovered:
▷ The importance of optimization in applied machine learning.
▷ How to do grid search to optimize by exhausting all possible solutions.
▷ How to use SciPy to optimize your own function.
▷ How to implement hill-climbing algorithm for optimization.
▷ How to use simulated annealing algorithm for optimization.
▷ What is gradient descent, how to use it, and some variation of this
algorithm.
This is just the beginning of your journey with optimization for machine
learning. Keep practicing and developing your skills. Take the next step and
check out my book on Optimization for Machine Learning.