0% found this document useful (0 votes)

5 views

ML Lecture2

Uploaded by

maitreyee.banerjee98

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

ML Lecture2

Uploaded by

maitreyee.banerjee98

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Machine Learning

(Lecture 2)

UEM/IEM Summer 2018

1
Reducing Loss: Gradient Descent
• Let’s replace the green box, “Computer parameter
updates” with something more substantial.

2
Reducing Loss: Gradient Descent
• If we calculate the loss for all possible values of w1,
the resulting plot of loss vs. w1 will always be
convex for the kind of regression problems we've
been examining, looks like the following:

Figure 1. Regression problems yield convex loss vs weight plots. 3

4
Source: https://math.stackexchange.com/questions/2364116/how-to-remember-which-function-is-concave-and-which-one-is-convex
Reducing Loss: Gradient Descent
• Convex problems have only one minimum; that is,
only one place where the slope is exactly 0. That
minimum is where the loss function converges.
• Calculating the loss function for every conceivable
value of w1 over the entire data set would be an
inefficient way of finding the convergence point.
Let's examine a better mechanism—very popular in
machine learning—called gradient descent.

5
Reducing Loss: Gradient Descent
• The first stage in gradient descent is to pick a
starting value (a starting point) for w1.
• The starting point doesn't matter much; therefore,
many algorithms simply set w1 to 0 or pick a
random value. The following figure shows that
we've picked a starting point slightly greater than 0:

6
Figure 2. A starting point for gradient descent.
Reducing Loss: Gradient Descent
• The gradient descent algorithm then calculates the
gradient of the loss curve at the starting point.
• In Figure 2, the gradient of loss is equal to
the derivative (slope) of the curve, and tells you
which way is "warmer" or "colder."
• When there are multiple weights, the gradient is a
vector of partial derivatives with respect to the
weights.
• Note that a gradient is a vector, so it has both of
the following characteristics:
• a direction
• a magnitude
7
Reducing Loss: Gradient Descent
• The gradient always points in the direction of
steepest increase in the loss function.
• The gradient descent algorithm takes a step in the
direction of the negative gradient in order to
reduce loss as quickly as possible.

8
Figure 3. Gradient descent relies on negative gradients.
Reducing Loss: Gradient Descent
• To determine the next point along the loss function
curve, the gradient descent algorithm adds some
fraction of the gradient's magnitude to the starting
point as shown in the following figure:

Figure 4. A gradient step moves us to the next point on the loss curve.

• The gradient descent then repeats this process,

edging ever closer to the minimum. 9
Reducing Loss: Learning Rate
• The gradient vector has both a direction and a
magnitude. Gradient descent algorithms multiply
the gradient by a scalar known as the learning
rate (also sometimes called step size) to determine
the next point.
• For example, if the gradient magnitude is 2.5 and
the learning rate is 0.01, then the gradient descent
algorithm will pick the next point 0.025 away from
the previous point.

10
Reducing Loss: Learning Rate
• Hyperparameters are the knobs that programmers
tweak in machine learning algorithms.
• A hyperparameter is a parameter whose value is set
before the learning process begins. (i.e., learning rate)
• Most machine learning programmers spend a fair
amount of time tuning the learning rate.
• If you pick a learning rate that is too small, learning
will take too long:

11
Figure 5. Learning rate is too small.
Reducing Loss: Learning Rate
• Conversely, if you specify a learning rate that is too
large, the next point will perpetually bounce
haphazardly across the bottom of the well like a
quantum mechanics experiment gone horribly
wrong:

12
Figure 6. Learning rate is too large.
Reducing Loss: Learning Rate
• There is a Goldilocks (just-right) learning rate for
every regression problem.
• The Goldilocks value is related to how flat the loss
function is.
• If you know the gradient of the loss function is
small then you can safely try a larger learning rate,
which compensates for the small gradient and
results in a larger step size.

13
Figure 7. Learning rate is just right.
Optimizing Learning Rate
• Experiment with different learning rates and see
how they affect the number of steps required to
reach the minimum of the loss curve.

https://developers.google.com/machine-learning/crash-course/fitter/graph

14
Optimizing Learning Rate
• Set a learning rate of 0.1 on the slider. Keep hitting
the STEP button until the gradient descent
algorithm reaches the minimum point of the loss
curve. How many steps did it take?

15
Optimizing Learning Rate
• Set a learning rate of 0.1 on the slider. Keep hitting
the STEP button until the gradient descent
algorithm reaches the minimum point of the loss
curve. How many steps did it take?

81 Steps

16
Optimizing Learning Rate
• Can you reach the minimum more quickly with a
higher learning rate? Set a learning rate of 1, and
keep hitting STEP until gradient descent reaches the
minimum. How many steps did it take this time?

17
Optimizing Learning Rate
• Can you reach the minimum more quickly with a
higher learning rate? Set a learning rate of 1, and
keep hitting STEP until gradient descent reaches the
minimum. How many steps did it take this time?

6 Steps

18
Optimizing Learning Rate
• How about an even larger learning rate. Reset the
graph, set a learning rate of 4, and try to reach the
minimum of the loss curve. What happened this
time?

19
Optimizing Learning Rate
• How about an even larger learning rate. Reset the
graph, set a learning rate of 4, and try to reach the
minimum of the loss curve. What happened this
time?

Gradient descent never reaches the

minimum. As a result, steps
progressively increase in size. Each
step jumps back and forth across the
bowl, climbing the curve instead of
descending to the bottom.

20
Optimizing Learning Rate
• Can you find the Goldilocks learning rate for this
curve, where gradient descent reaches the
minimum point in the fewest number of steps?
What is the fewest number of steps required to
reach the minimum?

21
Optimizing Learning Rate
• Can you find the Goldilocks learning rate for this
curve, where gradient descent reaches the
minimum point in the fewest number of steps?
What is the fewest number of steps required to
reach the minimum?

Goldilocks earning rate: 1.6 with just 1 step

22
Reducing Loss: Stochastic Gradient Descent
• In gradient descent, a batch is the total number of
examples you use to calculate the gradient in a
single iteration.
• So far, we've assumed that the batch has been the
entire data set.
• For example, when working at Google scale, data
sets often contain billions or even hundreds of
billions of examples.
• Furthermore, Google data sets often contain huge
numbers of features. Consequently, a batch can be
enormous. A very large batch may cause even a
single iteration to take a very long time to compute.
23
Reducing Loss: Stochastic Gradient Descent
• A large data set with randomly sampled examples
probably contains redundant data.
• In fact, redundancy becomes more likely as the
batch size grows.
• Some redundancy can be useful to smooth out
noisy gradients, but enormous batches tend not to
carry much more predictive value than large
batches.

24
Reducing Loss: Stochastic Gradient Descent
• What if we could get the right gradient on average
for much less computation?
• By choosing examples at random from our data set,
we could estimate (albeit, noisily) a big average from
a much smaller one.
• Stochastic gradient descent (SGD) takes this idea to
the extreme--it uses only a single example (a batch
size of 1) per iteration. Given enough iterations, SGD
works but is very noisy. The term "stochastic"
indicates that the one example comprising each
batch is chosen at random.
25
Reducing Loss: Stochastic Gradient Descent
• Mini-batch stochastic gradient descent (mini-batch
SGD) is a compromise between full-batch iteration
and SGD.
• A mini-batch is typically between 10 and 1,000
examples, chosen at random. Mini-batch SGD
reduces the amount of noise in SGD but is still
more efficient than full-batch.
• To simplify the explanation, we focused on gradient
descent for a single feature. Rest assured that
gradient descent also works on feature sets that
contain multiple features.

26
Quiz
• When performing gradient descent on a large data
set, which of the following batch sizes will likely be
more efficient?

1. A small batch or even a batch of one example (SGD).

2. The full batch.

27
Quiz
• When performing gradient descent on a large data
set, which of the following batch sizes will likely be
more efficient?

1. A small batch or even a batch of one example (SGD).

2. The full batch.

28
Machine Learning Libraries (Python)

29
Source: https://medium.com/activewizards-machine-learning-company/top-15-python-libraries-for-data-science-in-in-2017-ab61b4f9b4a7
First Steps with TensorFlow: Toolkit
• Tensorflow is a computational framework for
building machine learning models.
• TensorFlow provides a variety of different toolkits
that allow you to construct models at your
preferred level of abstraction.
• You can use lower-level APIs to build models by
defining a series of mathematical operations.
• Alternatively, you can use higher-level APIs (like
tf.estimator) to specify predefined architectures,
such as linear regressors or neural networks.
30
First Steps with TensorFlow: Toolkit
• The following figure shows the current hierarchy of
TensorFlow toolkits:

Figure 8. TensorFlow toolkit hierarchy. 31

First Steps with TensorFlow: Toolkit
• The following table summarizes the purposes of the
different layers:

32
First Steps with TensorFlow: Toolkit
• TensorFlow consists of the following two
components:
• a graph protocol buffer
• a runtime that executes the (distributed) graph
• These two components are analogous to Python
code and the Python interpreter.
• Just as the Python interpreter is implemented on
multiple hardware platforms to run Python code,
TensorFlow can run the graph on multiple hardware
platforms, including CPU, GPU, and TPU.
33
First Steps with TensorFlow: Toolkit
• Which API(s) should you use?
• You should use the highest level of abstraction that
solves the problem.
• The higher levels of abstraction are easier to use,
but are also (by design) less flexible.
• We recommend you start with the highest-level API
first and get everything working.
• If you need additional flexibility for some special
modeling concerns, move one level lower.
• Note that each level is built using the APIs in lower
levels, so dropping down the hierarchy should be
reasonably straightforward. 34
First Steps with TensorFlow: Toolkit
• Using tf.estimator dramatically lowers the number
of lines of code.
• tf.estimator is compatible with the scikit-learn API.
• Scikit-learn is an extremely popular open-source
ML library in Python, with over 100k users
• The pseudocode for a linear classification program
implemented in tf.estimator:

35
Reference
• This lecture note has been developed based on the
machine learning crash course at Google, which is
under Creative Commons Attribution 3.0 License.

Fluid Mechanics 101: Fundamentals Course 3
100% (3)
Fluid Mechanics 101: Fundamentals Course 3
53 pages
Gradient Descent
No ratings yet
Gradient Descent
17 pages
CalcIII Complete Solutions PDF
100% (1)
CalcIII Complete Solutions PDF
483 pages
Mathematical Techniques: Oxford
No ratings yet
Mathematical Techniques: Oxford
11 pages
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
No ratings yet
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
12 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
5 Optimizers
No ratings yet
5 Optimizers
10 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
ML - WEEK 06
No ratings yet
ML - WEEK 06
31 pages
GD Types
No ratings yet
GD Types
98 pages
Gradient Descent DS Rohit Sharma Fench Knjs
No ratings yet
Gradient Descent DS Rohit Sharma Fench Knjs
15 pages
Gradient Descent Unit3
No ratings yet
Gradient Descent Unit3
9 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
Gradient Descent
No ratings yet
Gradient Descent
4 pages
UNIT3
No ratings yet
UNIT3
37 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
5.1Loss Function, Optimization,Gd
No ratings yet
5.1Loss Function, Optimization,Gd
39 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
Lecture 08 ML
No ratings yet
Lecture 08 ML
20 pages
NN WK 3 Lec 5 6 Gradient Descent
No ratings yet
NN WK 3 Lec 5 6 Gradient Descent
7 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
ML MODULE 5 FULL NOTES
No ratings yet
ML MODULE 5 FULL NOTES
23 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
week 06 - Deep Feedforward Networks - Optimization
No ratings yet
week 06 - Deep Feedforward Networks - Optimization
83 pages
DL Unit -2
No ratings yet
DL Unit -2
20 pages
Gradient_Descent_(1)
No ratings yet
Gradient_Descent_(1)
8 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
UNIT2
No ratings yet
UNIT2
25 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
Gradient Descent (3) (2)
No ratings yet
Gradient Descent (3) (2)
27 pages
chp2 Gradient Descent algorithm
No ratings yet
chp2 Gradient Descent algorithm
5 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
PCA and Convex optimization and bias , Variance-2
No ratings yet
PCA and Convex optimization and bias , Variance-2
29 pages
DL UNIT 2
No ratings yet
DL UNIT 2
46 pages
Paper 2
No ratings yet
Paper 2
27 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
04 Optimization
No ratings yet
04 Optimization
62 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
ML3 Unit 4-3
No ratings yet
ML3 Unit 4-3
13 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
SCSA3015 Deep Learning Unit 4 PDF
No ratings yet
SCSA3015 Deep Learning Unit 4 PDF
30 pages
Gradient Descent Regression
No ratings yet
Gradient Descent Regression
14 pages
Gradient Descent - A Quick, Simple Introduction - Built in
No ratings yet
Gradient Descent - A Quick, Simple Introduction - Built in
15 pages
Gradient_decent
No ratings yet
Gradient_decent
15 pages
Yash 21bsds12
No ratings yet
Yash 21bsds12
3 pages
14-RMSProp and Adam Optimization-12!08!2024
No ratings yet
14-RMSProp and Adam Optimization-12!08!2024
2 pages
Introduction-to-Gradient-Descent (2)
No ratings yet
Introduction-to-Gradient-Descent (2)
8 pages
Gradient Descent Final
No ratings yet
Gradient Descent Final
27 pages
CV Lec4
No ratings yet
CV Lec4
46 pages
4. Gradient Descent
No ratings yet
4. Gradient Descent
15 pages
9.b Handout-3-GD variants
No ratings yet
9.b Handout-3-GD variants
3 pages
Technical_writing (1)
No ratings yet
Technical_writing (1)
9 pages
AI33
No ratings yet
AI33
6 pages
CCS355 Neural Networks and Deep Learning
No ratings yet
CCS355 Neural Networks and Deep Learning
142 pages
Unit 4 Final
No ratings yet
Unit 4 Final
29 pages
Technical_writing (2)
No ratings yet
Technical_writing (2)
9 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
Hill Climbing: Fundamentals and Applications
From Everand
Hill Climbing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Magic wIth Math
From Everand
Magic wIth Math
Rajinder Goswami
5/5 (2)
CSEC Add Maths - Paper 2 - June 2021 - Solutio
No ratings yet
CSEC Add Maths - Paper 2 - June 2021 - Solutio
32 pages
Laplacian PDF
No ratings yet
Laplacian PDF
13 pages
(Radulescu V.D.) Nonlinear Partial Differential Eq
No ratings yet
(Radulescu V.D.) Nonlinear Partial Differential Eq
114 pages
Equation of Straight Line Part 1
No ratings yet
Equation of Straight Line Part 1
13 pages
Classical Mechanics Lesson 1 - Vector Algebra and Vector Calculus PDF
No ratings yet
Classical Mechanics Lesson 1 - Vector Algebra and Vector Calculus PDF
29 pages
Contour
No ratings yet
Contour
31 pages
Foundations of Calculus For Data Science An Foundational Guide To PDF
No ratings yet
Foundations of Calculus For Data Science An Foundational Guide To PDF
216 pages
Vectors, Spinors, and Complex Numbers in Classical and Quantum Physics David Hestenes
100% (1)
Vectors, Spinors, and Complex Numbers in Classical and Quantum Physics David Hestenes
23 pages
Steps Into Calculus Finding Stationary Points
No ratings yet
Steps Into Calculus Finding Stationary Points
6 pages
Exercises Parametric Differentiation
No ratings yet
Exercises Parametric Differentiation
30 pages
13 - Vector Analysis
No ratings yet
13 - Vector Analysis
10 pages
V/CMS: Revision Worksheet Ib1
No ratings yet
V/CMS: Revision Worksheet Ib1
6 pages
Differentiation
No ratings yet
Differentiation
47 pages
[Ebooks PDF] download Vector Analysis and Cartesian Tensors Bourne full chapters
67% (3)
[Ebooks PDF] download Vector Analysis and Cartesian Tensors Bourne full chapters
62 pages
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
No ratings yet
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
26 pages
PHY103A: Lecture # 2: Semester II, 2017-18 Department of Physics, IIT Kanpur
No ratings yet
PHY103A: Lecture # 2: Semester II, 2017-18 Department of Physics, IIT Kanpur
21 pages
DGDG
No ratings yet
DGDG
66 pages
G Sobel Prewitt edge detector+ 2x2 mask roberts edge detector and their operation
No ratings yet
G Sobel Prewitt edge detector+ 2x2 mask roberts edge detector and their operation
4 pages
Optional Lab - Multiple Linear Regression - Coursera
No ratings yet
Optional Lab - Multiple Linear Regression - Coursera
2 pages
SFace Loss
No ratings yet
SFace Loss
12 pages
OCR C1 June 05 - Jun 10 PDF
No ratings yet
OCR C1 June 05 - Jun 10 PDF
25 pages
Topic 1
No ratings yet
Topic 1
62 pages
Matrix Calc
No ratings yet
Matrix Calc
23 pages
Partials
No ratings yet
Partials
54 pages
Ladders, Moats, and Lagrange Multipliers: Tutorial
No ratings yet
Ladders, Moats, and Lagrange Multipliers: Tutorial
5 pages
Download Essentials of Mathematical Methods in Science and Engineering 2nd Edition S. Selcuk Bayin ebook All Chapters PDF
100% (2)
Download Essentials of Mathematical Methods in Science and Engineering 2nd Edition S. Selcuk Bayin ebook All Chapters PDF
65 pages

ML Lecture2

Uploaded by

ML Lecture2

Uploaded by

Machine Learning

UEM/IEM Summer 2018

Figure 1. Regression problems yield convex loss vs weight plots. 3

• The gradient descent then repeats this process,

Gradient descent never reaches the

Goldilocks earning rate: 1.6 with just 1 step

1. A small batch or even a batch of one example (SGD).

1. A small batch or even a batch of one example (SGD).

Figure 8. TensorFlow toolkit hierarchy. 31

You might also like