ML Lecture2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Machine Learning

(Lecture 2)

UEM/IEM Summer 2018

1
Reducing Loss: Gradient Descent
• Let’s replace the green box, “Computer parameter
updates” with something more substantial.

2
Reducing Loss: Gradient Descent
• If we calculate the loss for all possible values of w1,
the resulting plot of loss vs. w1 will always be
convex for the kind of regression problems we've
been examining, looks like the following:

Figure 1. Regression problems yield convex loss vs weight plots. 3


4
Source: https://math.stackexchange.com/questions/2364116/how-to-remember-which-function-is-concave-and-which-one-is-convex
Reducing Loss: Gradient Descent
• Convex problems have only one minimum; that is,
only one place where the slope is exactly 0. That
minimum is where the loss function converges.
• Calculating the loss function for every conceivable
value of w1 over the entire data set would be an
inefficient way of finding the convergence point.
Let's examine a better mechanism—very popular in
machine learning—called gradient descent.

5
Reducing Loss: Gradient Descent
• The first stage in gradient descent is to pick a
starting value (a starting point) for w1.
• The starting point doesn't matter much; therefore,
many algorithms simply set w1 to 0 or pick a
random value. The following figure shows that
we've picked a starting point slightly greater than 0:

6
Figure 2. A starting point for gradient descent.
Reducing Loss: Gradient Descent
• The gradient descent algorithm then calculates the
gradient of the loss curve at the starting point.
• In Figure 2, the gradient of loss is equal to
the derivative (slope) of the curve, and tells you
which way is "warmer" or "colder."
• When there are multiple weights, the gradient is a
vector of partial derivatives with respect to the
weights.
• Note that a gradient is a vector, so it has both of
the following characteristics:
• a direction
• a magnitude
7
Reducing Loss: Gradient Descent
• The gradient always points in the direction of
steepest increase in the loss function.
• The gradient descent algorithm takes a step in the
direction of the negative gradient in order to
reduce loss as quickly as possible.

8
Figure 3. Gradient descent relies on negative gradients.
Reducing Loss: Gradient Descent
• To determine the next point along the loss function
curve, the gradient descent algorithm adds some
fraction of the gradient's magnitude to the starting
point as shown in the following figure:

Figure 4. A gradient step moves us to the next point on the loss curve.

• The gradient descent then repeats this process,


edging ever closer to the minimum. 9
Reducing Loss: Learning Rate
• The gradient vector has both a direction and a
magnitude. Gradient descent algorithms multiply
the gradient by a scalar known as the learning
rate (also sometimes called step size) to determine
the next point.
• For example, if the gradient magnitude is 2.5 and
the learning rate is 0.01, then the gradient descent
algorithm will pick the next point 0.025 away from
the previous point.

10
Reducing Loss: Learning Rate
• Hyperparameters are the knobs that programmers
tweak in machine learning algorithms.
• A hyperparameter is a parameter whose value is set
before the learning process begins. (i.e., learning rate)
• Most machine learning programmers spend a fair
amount of time tuning the learning rate.
• If you pick a learning rate that is too small, learning
will take too long:

11
Figure 5. Learning rate is too small.
Reducing Loss: Learning Rate
• Conversely, if you specify a learning rate that is too
large, the next point will perpetually bounce
haphazardly across the bottom of the well like a
quantum mechanics experiment gone horribly
wrong:

12
Figure 6. Learning rate is too large.
Reducing Loss: Learning Rate
• There is a Goldilocks (just-right) learning rate for
every regression problem.
• The Goldilocks value is related to how flat the loss
function is.
• If you know the gradient of the loss function is
small then you can safely try a larger learning rate,
which compensates for the small gradient and
results in a larger step size.

13
Figure 7. Learning rate is just right.
Optimizing Learning Rate
• Experiment with different learning rates and see
how they affect the number of steps required to
reach the minimum of the loss curve.

https://developers.google.com/machine-learning/crash-course/fitter/graph

14
Optimizing Learning Rate
• Set a learning rate of 0.1 on the slider. Keep hitting
the STEP button until the gradient descent
algorithm reaches the minimum point of the loss
curve. How many steps did it take?

15
Optimizing Learning Rate
• Set a learning rate of 0.1 on the slider. Keep hitting
the STEP button until the gradient descent
algorithm reaches the minimum point of the loss
curve. How many steps did it take?

81 Steps

16
Optimizing Learning Rate
• Can you reach the minimum more quickly with a
higher learning rate? Set a learning rate of 1, and
keep hitting STEP until gradient descent reaches the
minimum. How many steps did it take this time?

17
Optimizing Learning Rate
• Can you reach the minimum more quickly with a
higher learning rate? Set a learning rate of 1, and
keep hitting STEP until gradient descent reaches the
minimum. How many steps did it take this time?

6 Steps

18
Optimizing Learning Rate
• How about an even larger learning rate. Reset the
graph, set a learning rate of 4, and try to reach the
minimum of the loss curve. What happened this
time?

19
Optimizing Learning Rate
• How about an even larger learning rate. Reset the
graph, set a learning rate of 4, and try to reach the
minimum of the loss curve. What happened this
time?

Gradient descent never reaches the


minimum. As a result, steps
progressively increase in size. Each
step jumps back and forth across the
bowl, climbing the curve instead of
descending to the bottom.

20
Optimizing Learning Rate
• Can you find the Goldilocks learning rate for this
curve, where gradient descent reaches the
minimum point in the fewest number of steps?
What is the fewest number of steps required to
reach the minimum?

21
Optimizing Learning Rate
• Can you find the Goldilocks learning rate for this
curve, where gradient descent reaches the
minimum point in the fewest number of steps?
What is the fewest number of steps required to
reach the minimum?

Goldilocks earning rate: 1.6 with just 1 step

22
Reducing Loss: Stochastic Gradient Descent
• In gradient descent, a batch is the total number of
examples you use to calculate the gradient in a
single iteration.
• So far, we've assumed that the batch has been the
entire data set.
• For example, when working at Google scale, data
sets often contain billions or even hundreds of
billions of examples.
• Furthermore, Google data sets often contain huge
numbers of features. Consequently, a batch can be
enormous. A very large batch may cause even a
single iteration to take a very long time to compute.
23
Reducing Loss: Stochastic Gradient Descent
• A large data set with randomly sampled examples
probably contains redundant data.
• In fact, redundancy becomes more likely as the
batch size grows.
• Some redundancy can be useful to smooth out
noisy gradients, but enormous batches tend not to
carry much more predictive value than large
batches.

24
Reducing Loss: Stochastic Gradient Descent
• What if we could get the right gradient on average
for much less computation?
• By choosing examples at random from our data set,
we could estimate (albeit, noisily) a big average from
a much smaller one.
• Stochastic gradient descent (SGD) takes this idea to
the extreme--it uses only a single example (a batch
size of 1) per iteration. Given enough iterations, SGD
works but is very noisy. The term "stochastic"
indicates that the one example comprising each
batch is chosen at random.
25
Reducing Loss: Stochastic Gradient Descent
• Mini-batch stochastic gradient descent (mini-batch
SGD) is a compromise between full-batch iteration
and SGD.
• A mini-batch is typically between 10 and 1,000
examples, chosen at random. Mini-batch SGD
reduces the amount of noise in SGD but is still
more efficient than full-batch.
• To simplify the explanation, we focused on gradient
descent for a single feature. Rest assured that
gradient descent also works on feature sets that
contain multiple features.

26
Quiz
• When performing gradient descent on a large data
set, which of the following batch sizes will likely be
more efficient?

1. A small batch or even a batch of one example (SGD).


2. The full batch.

27
Quiz
• When performing gradient descent on a large data
set, which of the following batch sizes will likely be
more efficient?

1. A small batch or even a batch of one example (SGD).


2. The full batch.

28
Machine Learning Libraries (Python)

29
Source: https://medium.com/activewizards-machine-learning-company/top-15-python-libraries-for-data-science-in-in-2017-ab61b4f9b4a7
First Steps with TensorFlow: Toolkit
• Tensorflow is a computational framework for
building machine learning models.
• TensorFlow provides a variety of different toolkits
that allow you to construct models at your
preferred level of abstraction.
• You can use lower-level APIs to build models by
defining a series of mathematical operations.
• Alternatively, you can use higher-level APIs (like
tf.estimator) to specify predefined architectures,
such as linear regressors or neural networks.
30
First Steps with TensorFlow: Toolkit
• The following figure shows the current hierarchy of
TensorFlow toolkits:

Figure 8. TensorFlow toolkit hierarchy. 31


First Steps with TensorFlow: Toolkit
• The following table summarizes the purposes of the
different layers:

32
First Steps with TensorFlow: Toolkit
• TensorFlow consists of the following two
components:
• a graph protocol buffer
• a runtime that executes the (distributed) graph
• These two components are analogous to Python
code and the Python interpreter.
• Just as the Python interpreter is implemented on
multiple hardware platforms to run Python code,
TensorFlow can run the graph on multiple hardware
platforms, including CPU, GPU, and TPU.
33
First Steps with TensorFlow: Toolkit
• Which API(s) should you use?
• You should use the highest level of abstraction that
solves the problem.
• The higher levels of abstraction are easier to use,
but are also (by design) less flexible.
• We recommend you start with the highest-level API
first and get everything working.
• If you need additional flexibility for some special
modeling concerns, move one level lower.
• Note that each level is built using the APIs in lower
levels, so dropping down the hierarchy should be
reasonably straightforward. 34
First Steps with TensorFlow: Toolkit
• Using tf.estimator dramatically lowers the number
of lines of code.
• tf.estimator is compatible with the scikit-learn API.
• Scikit-learn is an extremely popular open-source
ML library in Python, with over 100k users
• The pseudocode for a linear classification program
implemented in tf.estimator:

35
Reference
• This lecture note has been developed based on the
machine learning crash course at Google, which is
under Creative Commons Attribution 3.0 License.

36

You might also like