0% found this document useful (0 votes)

2 views60 pages

Lecture 4 1

This document is a lecture on optimization techniques for training deep models, covering various topics including first and second order methods, the distinction between learning and pure optimization, and challenges in deep model training. It emphasizes the complexity of neural network training and the importance of specialized optimization methods, while also discussing empirical risk minimization and surrogate loss functions. The lecture concludes with an overview of batch and minibatch algorithms and the difficulties posed by ill-conditioning and local minima in optimization.

Uploaded by

Ashwin K.L

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views60 pages

Lecture 4 1

Uploaded by

Ashwin K.L

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

ME 780

Lecture 4: Optimization for Training Deep Models

Ali Harakeh

University of Waterloo
WAVE Lab
[email protected]

June 13, 2017

1/60
ME 780

Overview

1 Optimization: Introduction

2 Review Of First and Second Order Methods

3 The Difference Between Learning and Pure Optimization

4 Batch and Minibatch Algorithms

5 Challenges In Deep Model Training

6 Conclusion

2/60
ME 780
Optimization: Introduction

Section 1

Optimization: Introduction

3/60
ME 780
Optimization: Introduction

Introduction

Of all the many optimization problems involved in deep

learning, the most difficult is neural network training.
It is quite common to invest days to months of time on
hundreds of machines in order to solve even a single instance
of the neural network training problem.
Because this problem is so important and so expensive,a
specialized set of optimization techniques have been
developed for solving it.
We will focus on one particular case of optimization: Finding
the parameters θ of a neural network that significantly reduces
a cost function J(θ).

4/60
ME 780
Review Of First and Second Order Methods

Section 2

Review Of First and Second Order Methods

5/60
ME 780
Review Of First and Second Order Methods

First Order Optimization Algorithms

Optimization algorithms that use only the gradient are termed

first order optimization algorithms.
An example would be gradient decent where the update rule
is:

x ← x − ∇x f (x)

First order algorithms are optimal for neural network training

since the target loss functions can be decomposed to a sum
over training data.

6/60
ME 780
Review Of First and Second Order Methods

Second Order Optimization Algorithms

Optimization algorithms that make use of the Hessian matrix

are termed second order optimization algorithms.
The Hessian Matrix is a matrix of second derivatives of a
2
function f : Rn → R with each element Hi,j = ∂x∂i ∂xj f (x).
The matrix H ∈ Sn anywhere that the second derivative is
continuous. This implies that Hi,j = Hj,i , the differential
operators are commutative.

7/60
ME 780
Review Of First and Second Order Methods

Second Order Optimization Algorithms

The simplest second order optimization algorithm is

Newton’s Method. It has the following update rule:

x∗ = x(0) − H −1 ∇x f (x(0) )

Newton’s method consists of applying the above equation

iteratively to jump to the minimum of the function directly.

8/60
ME 780
Review Of First and Second Order Methods

Non Deep Disadvantages of Second Order Optimization

Algorithms

The above algorithm requires the inversion of the Hessian

matrix which in turn confers two disadvantages:
Inversion is computationally intensive when the Hessian matrix
is large.
Inversion is highly unstable when the Hessian’s condition
number is large.
The condition number of the matrix is the ratio of the
magnitude of the largest singular value to the smallest.

9/60
ME 780
Review Of First and Second Order Methods

Lipschitz Continuity

The optimization algorithms that will be discussed in this

lecture can be applied to a wide variety of functions.
However, these algorithms come with no guarantees when it
come to deep learning.
In the context of deep learning, we sometimes gain some
guarantees by restricting ourselves to functions that are either
Lipschitz continuous or have Lipschitz continuous
derivatives.

10/60
ME 780
Review Of First and Second Order Methods

Lipschitz Continuity

A Lipschitz continuous function is a function f whose rate of

change is bounded by a Lipschitz constant L:

∀x∀y, |f (x) − f (y)| ≤ L||x − y||2 .

This property is useful because it allows us to quantify our

assumption that a small change in the input made by an
algorithm such as gradient descent will have a small change in
the output.
Lipschitz continuity is also a fairly weak constraint,and many
optimization problems in deep learning can be made Lipschitz
continuous with relatively minor modification.

11/60
ME 780
The Difference Between Learning and Pure Optimization

Section 3

The Difference Between Learning and Pure

Optimization

12/60
ME 780
The Difference Between Learning and Pure Optimization

How Learning Differs From Pure Optimization

Learning differs from pure optimization in many ways, the

most prominent one being the observability of the true loss
function.
In most scenarios, we care about a performance measure P
that is defined on the test set. This measure is usually
intractable.
We optimize P indirectly by reducing a different cost function
J(θ) in hope that doing so will improve P.
This is in contrast to pure optimization where minimizing J(θ)
is the goal itself.

13/60
ME 780
The Difference Between Learning and Pure Optimization

The Form Of Loss Functions In Deep Learning

In context of deep learning, the Loss function can be written

as an average over the training set:

J(θ) = E(x,y )∼p̂data L(f (x; θ), y )

Notice that the expectation is over the training data. We

would usually prefer to minimize the expectation over the data
generating distribution pdata rather than over the finite
training set:

J(θ)∗ = E(x,y )∼pdata L(f (x; θ), y )

14/60
ME 780
The Difference Between Learning and Pure Optimization

Empirical Risk Minimization

The quantity J ∗ is referred to as the risk.

If we knew pdata , risk minimization will be reduced to a
standard optimization task.
Since pdata is not known, we minimize the empirical risk:
m
1 X
E(x,y )∼p̂data L(f (x; θ), y ) = L(f (x(i) ; θ), y (i) )
m i=1

This whole process is known as empirical risk minimization.

Rather than optimizing the risk directly, we optimize the
empirical risk, and hope that the risk decreases significantly as
well.

15/60
ME 780
The Difference Between Learning and Pure Optimization

Empirical Risk Minimization

A variety of theoretical results establish conditions under

which the true risk can be expected to decrease by various
amounts.
What is the problem with empirical risk minimization ?

16/60
ME 780
The Difference Between Learning and Pure Optimization

Empirical Risk Minimization

Empirical risk minimization is prone to overfitting. Models

with high enough capacity can simply memorize the training
set.
Furthermore, the most effective optimization algorithms rely
on gradient decent. However, many useful Loss functions have
no useful derivatives. (0-1 Loss ?)
The above two problems means that in the context of deep
learning, we cannot usually use empirical risk minimization.
We should rely on a slightly different approach, in which the
quantity that we actually optimize is even more different from
the quantity that we truly want to optimize.

17/60
ME 780
The Difference Between Learning and Pure Optimization

Surrogate Loss Functions

Instead of minimizing empirical risk, we minimize surrogate

loss functions.
A surrogate loss function acts as a proxy to empirical risk
while being ”nice” enough to be optimized efficiently.
Example: SVM loss is a surrogate to the 0-1 loss for
classification.

18/60
ME 780
The Difference Between Learning and Pure Optimization

Early Stopping

In contrast to standard optimization, training algorithms do

not halt at local minima, but when early stopping halt
criterion is satisfied.
Typically, the early stopping criterion is based on an
underlying loss function such as the 0-1 loss measured on the
validation set, and is designed to halt the algorithm before
overfitting occurs.
This can be roughly thought of as a way to reincorporate the
true loss function in the learning process.

19/60
ME 780
Batch and Minibatch Algorithms

Section 4

Batch and Minibatch Algorithms

20/60
ME 780
Batch and Minibatch Algorithms

Decomposing Machine Learning Loss Functions

One aspect of machine learning algorithms that separates

them from general optimization algorithms is that the
objective function usually decomposes as a sum over the
training examples:

J(θ) = Ex,y ∼p̂data log pmodel (x, y ; θ)

The gradient in this case is also an expectation over training

data:

∇θ J(θ) = Ex,y ∼p̂data ∇θ log pmodel (x, y ; θ)

21/60
ME 780
Batch and Minibatch Algorithms

Computing The Expectation

Computing this expectation exactly is very expensive

It requires evaluating the model on every example in the
entire dataset.
However, we can compute these expectations by randomly
sampling a small number of examples from the dataset at
every iteration.
Why does it work ?

22/60
ME 780
Batch and Minibatch Algorithms

Computing The Expectation

Standard error of the mean estimated from n samples is √σn ,

where σ is the standard deviation of the value of the samples.
The denominator shows that there is a less than linear return
to using more examples to estimate the gradient.
Furthermore, optimization algorithms usually converge faster
if they are allowed to rapidly compute approximate estimates
of the gradient rather than slowly computing exact gradients.

23/60
ME 780
Batch and Minibatch Algorithms

Computing The Expectation

Another consideration that motivates the estimation of the

gradient from a small number of samples is the redundancy
in the training set.
At worst case where all m training examples are identical,
evaluating the gradient at a single training example will return
its true value.
Complexity is Θ(m) for the naive approach vs Θ(1) for the
approximation!
In practice, we are unlikely to truly encounter this worst-case
situation, but we may find large numbers of examples that all
make very similar contributions to the gradient.

24/60
ME 780
Batch and Minibatch Algorithms

Batch, Online, and Minibatch Algorithms

Optimization algorithms that use the entire training set to

compute the gradient are called batch or deterministic
gradient methods. Ones that use a single training example for
that task are called stochastic or online gradient methods.
Most of the algorithms we use for deep learning fall
somewhere in between !
These are called minibatch or minibatch stochastic
methods.
We will be calling them stochastic methods! Take that
optimization experts!

25/60
ME 780
Batch and Minibatch Algorithms

What Minibatch Size Should We Use ?

Larger batches provide a more accurate estimate of the

gradient, but with less than linear returns.
Multicore architectures are usually underutilized by extremely
small batches.This motivates using some absolute minimum
batch size, below which there is no reduction in the time to
process a minibatch.

26/60
ME 780
Batch and Minibatch Algorithms

What Minibatch Size Should We Use ?

Some kinds of hardware achieve better runtime with specific

sizes of arrays. Especially when using GPUs, it is common for
power of 2 batch sizes to offer better runtime. Typical power
of 2 batch sizes range from 32 to 256, with 16 sometimes
being attempted for large models.
Small batches can offer a regularizing effect. The best
generalization error is often achieved with batch size of 1
(Wilson and Martinez, 2003).

27/60
ME 780
Batch and Minibatch Algorithms

Sampling Minibatches

It is extremely crucial that minibatches are sampled at

random.
Why ?

28/60
ME 780
Batch and Minibatch Algorithms

Sampling Minibatches

Computing an unbiased estimate of the expected gradient

from a set of samples requires that those samples be
independent.
Also, two subsequent gradient estimates are required to be
independent from each other, so two subsequent minibatches
of examples should also be independent from each other.

29/60
ME 780
Batch and Minibatch Algorithms

Sampling Minibatches: Permuting The Dataset

Be very careful when your dataset is arranged in a correlated

manner (most of datasets naturally are).
Example: Consecutive video frames for object detection for
autonomous driving.
It is absolute necessity to shuffle the examples before selecting
every minibatch.

30/60
ME 780
Batch and Minibatch Algorithms

Sampling Minibatches: Permuting The Dataset

In practice shuffling the dataset before every minibatch

selection is not feasible.
We usually only shuffle the dataset once and store it on the
hard drive imposing a fixed set of possible minibatches of
consecutive examples that all models trained thereafter will
use.
The above method is not exactly random selection, but does
not seem to have a detrimental effect on training.
Failing to shuffle the dataset at all will seriously reduce the
effectiveness of your algorithm.

31/60
ME 780
Challenges In Deep Model Training

Section 5

Challenges In Deep Model Training

32/60
ME 780
Challenges In Deep Model Training

Optimization Is a Hard Problem

General optimization by itself is an extremely difficult task.

Traditionally, machine learning has avoided this difficulty by
carefully designing the objective function and constraints to
insure the problem is convex.
Convex optimization is not without complications. However,
this is not a problem for deep models as problems we usually
face in that context are non-convex.
This section summarizes the most prominent complications we
face in deep model training.

33/60
ME 780
Challenges In Deep Model Training

Ill-Conditioning

Ill-conditioning of the Hessian matrix is a prominent problem

in most numerical optimization problems, convex or otherwise.
Ill-conditioning is manifested in SGD by causing the algorithm
to get stuck, in a sense that even very small steps increase
the cost function.
Even if the algorithms doesn’t get stuck, learning will proceed
very slowly when the Hessian matrix has a large condition
number.
Proof or Intuition ?

34/60
ME 780
Challenges In Deep Model Training

Ill-Conditioning

In multiple dimensions, there is a different second derivative

for each direction at a single point.
The condition number of the Hessian at this point measures
how much the second derivatives differ from each other.
When the Hessian has a large condition number, gradient
descent performs poorly. This is because in one direction, the
derivative increases rapidly, while in another direction, it
increases slowly.
Gradient descent is unaware of this change in the derivative so
it does not know that it needs to explore preferentially in the
direction where the derivative remains negative for longer.

35/60
ME 780
Challenges In Deep Model Training

Ill-Conditioning

36/60
ME 780
Challenges In Deep Model Training

Local Minima

When optimizing a convex function, we know that we have

reached a good solution if we find a critical point of any kind.
With non-convex functions, such as neural nets, that is not
the case.
Functions involved in deep models are guaranteed to have an
extremely large number of local minima.
However, we will see that local minima are not necessarily a
major problem.

37/60
ME 780
Challenges In Deep Model Training

Local Minima: Model identifiability Problem

Any model with multiple equivalently parametrized latent

variables will have multiple local minima because of the
model identifiability problem.
A model is said to be identifiable if a suffeciently large training
set can rule out all but one setting of the model’s parameters.
Are deep models identifiable ?

38/60
ME 780
Challenges In Deep Model Training

Local Minima: Model identifiability Problem

The answer is no. In fact, if we have m layers with n hidden

units each, there are n!m ways of arranging the hidden units
to obtain equivalent models.
Proof ?
This is called weight space symmetry.

39/60
ME 780
Challenges In Deep Model Training

Local Minima: Model identifiability Problem

There are many additional causes for non-identifiability in

deep models.
For example, in any ReLU or Max-Out unit, we can scale all
incoming weights and biases by α1 and then scale all the
outgoing weights by α to achieve the same final value.
If the cost function does not include regularization terms that
depend directly on weights (norm penalty), every local
minimum of a ReLU or Max-Out unit lies on an m × n
dimensional hyperbola of equivalent local minima.

40/60
ME 780
Challenges In Deep Model Training

Local Minima: Model identifiability Problem

Model identifiability issues mean that there can be an

extremely large or even uncountably infinite amount of local
minima in the cost function of deep models.
However, these local minima are all equivalent in value and
are not a problematic form of non-convexity.
Local minima are only truly problematic if they have a much
higher cost than the global minimum.
It remains an open question whether there are many local
minima of high cost for networks of practical interest and
whether optimization algorithms encounter them.

41/60
ME 780
Challenges In Deep Model Training

Local Minima: How Problematic Are They ?

It remains an open question whether there are many local

minima of high cost for networks of practical interest and
whether optimization algorithms encounter them.
Experts now suspect that, for sufficiently large neural
networks, most local minima have a low cost function value.
Furthermore, it is not important to find a true global
minimum. We can do with finding a point in parameter space
that has low but not minimal cost (Saxe et al., 2013; Dauphin
et al., 2014; Goodfellow et al., 2015; Choromanskaet al.,
2014).

42/60
ME 780
Challenges In Deep Model Training

Local Minima: Conclusion

Many practitioners attribute nearly all difficulty with neural

network optimization to local minima.
Please test for specific cases before you arrive to a conclusion.
A simple test to try is the gradient norm test. If the
gradient norm does not shrink over time, the problem is not a
local minima problem.
If it does shrink, the test is inconclusive. Why?

43/60
ME 780
Challenges In Deep Model Training

Local Minima: Conclusion

44/60
ME 780
Challenges In Deep Model Training

Saddle Points

For many high-dimensional non-convex functions, local

minima and maxima are in fact rare compared to saddle
points.
A saddle point is a point where the Hessian matrix has both
positive and negative eigenvalues.
We can think of a saddle point as being a local minimum
along one cross-section of the cost function and a local
maximum along another cross-section.

45/60
ME 780
Challenges In Deep Model Training

Saddle Points

Many classes of random functions exhibit the following

behaviour:
In lower dimensional spaces, local minima and maxima are
common.
In higher dimensional spaces, local minima and maxima are
rare, saddle points are much more common.
For a function f : Rn → R, the expected ratio of the number
of saddle points to local minima grows exponentially with n.
What is the intuition behind this phenomenon ?

46/60
ME 780
Challenges In Deep Model Training

Saddle Points

Another amazing phenomenon that occurs in many random

functions is that the eigenvalues of the Hessian becomes more
likely to be positive as we reach regions of low cost.
This means that local minima are much more likely to have a
low cost than a high cost.
Also, critical points with high cost are much more likely to be
saddle points and those with extremely high cost are much
more likely yo be local maxima.

47/60
ME 780
Challenges In Deep Model Training

Plateaus, Saddle Points, and Other Flat Regions

Baldik and Hornik (1989) showed theoretically that shallow

autoencoders with no non-linearities have global minima and
saddle points but no local minima with a cost value greater
than the global minimum.
Saxe et al. (2013) provided exact solutions to the complete
learning dynamics in linear networks and showed that learning
in these models captures many of the qualitative features
observed in the training of deep models with non-linear
activation functions.
Dauphit et al. (2014) showed experimentally that real neural
networks also have loss functions that contain very many
high-cost saddle points.

48/60
ME 780
Challenges In Deep Model Training

Saddle Points

What are the implications of the proliferation of saddle points

in the cost functions of deep models ?

49/60
ME 780
Challenges In Deep Model Training

Saddle Points: The Effect On First Order Optimization

Algorithms

For first-order optimization algorithms that use only gradient

information,the situation is unclear.
The gradient can often become very small near a saddle point.
On the other hand, gradient descent empirically seems to be
able to escape saddle points in many cases.
Goodfellow et al. (2015) empirically showed that the gradient
decent trajectory rapidly escapes saddle points.
They also argued that that continuous-time gradient descent
may be shown analytically to be repelled from, rather than
attracted to, a nearby saddle point, but the situation may be
different for more realistic uses of gradient descent.

50/60
ME 780
Challenges In Deep Model Training

Saddle Points: The Effect On First Order Optimization

Algorithms

51/60
ME 780
Challenges In Deep Model Training

Saddle Points: The Effect On Second Order

Optimization Algorithms

For Newton’s Method, saddle points constitute a major

problem. This is because unlike gradient decent, which is
designed to move downhill, Newton’s method actively seeks
solutions at critical points where the gradient is zero.
The proliferation of saddle points in high dimensional spaces
explains why second order methods have failed to replace
gradient decent for deep learning.
Dauphin et al. (2014) introduced a saddle-free Newton
Method for second order optimization.
Second-order methods remain difficult to scale to large neural
networks, but this saddle-free approach holds promise if it
could be scaled.

52/60
ME 780
Challenges In Deep Model Training

Cliffs And Exploding Gradients

Neural networks with many layers often have extremely steep

regions reassembling cliffs.

53/60
ME 780
Challenges In Deep Model Training

Cliffs And Exploding Gradients

Cliffs result from the multiplication of several large weights

together.
On the face of an extremely steep cliff structure, the gradient
update step can move the parameters extremely far, usually
jumping off of the cliff structure altogether.
Very very dangerous when approached from above or bellow.
Solutions ?

54/60
ME 780
Challenges In Deep Model Training

Cliffs And Exploding Gradients: Gradient Clipping

Gradients do not specify the optimal step size, but only the
optimal direction within and infinitesimal region.
When the traditional gradient descent algorithm proposes to
make a very large step, the gradient clipping heuristic
intervenes to reduce the step size to be small enough that it is
less likely to go outside the region where the gradient
indicates the direction of approximately steepest descent.

55/60
ME 780
Challenges In Deep Model Training

Additional Problems Faced In Neural Network

Optimization

Long Term Dependencies: Arises when the computational

graph is very deep. The result of this problem is vanishing and
exploding gradients.
Inexact Gradients: In practice, we usually only have a noisy
or even biased estimate of the gradient and the Hessian.
Sometimes, gradients for our loss functions are even
intractable. Not a big issue in neural network training.
Surrogate loss functions tend to perform well enough in
practice.

56/60
ME 780
Challenges In Deep Model Training

Additional Problems Faced In Neural Network

Optimization

Poor Correspondence between Local and Global

Structure: It is possible to overcome all of the above
problems at a single point and still perform poorly if the
direction that results in the most improvement locally does
not point toward distant regions of much lower cost.
Initialization is really important !

57/60
ME 780
Conclusion

Section 6

Conclusion

58/60
ME 780
Conclusion

Conclusion

Optimizing cost functions for deep networks is really hard.

We almost never arrive to a global minimum, our goal is to
reduce the generalization error rather than the cost function
itself.
Theoretical analysis of whether an optimization algorithm can
accomplish this goal is extremely difficult. Developing more
realistic bounds on the performance of optimization
algorithms therefore remains an important goal for machine
learning research.
Also, initialization is very important for stable performance of
our optimization algorithms.

59/60
ME 780
Conclusion

Next Lecture

Parameter Initialization Strategies.

SGD Varients: SGD, Momentum, Nestrov Momentum,
AdaGrad, RMS-Prop, Adam.
Tentative: Newton’s Method, Conjugate Gradients,
Broyden-Fletcher-Goldfarb-Shanno Algorithms (BFGS),
Limited Memory BFGS (L-BFGS).
Batch Norm, Coordinate Descent, Polyak Averaging, Greedy
Supervised Pretraining.
Curriculum Learning.

60/60

Unit 2 Introduction To Deep Learning
No ratings yet
Unit 2 Introduction To Deep Learning
79 pages
DL Unit 4&5
No ratings yet
DL Unit 4&5
27 pages
Deep Learning Chapter 1
No ratings yet
Deep Learning Chapter 1
46 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
Practical Issues in Neural Network Training
No ratings yet
Practical Issues in Neural Network Training
15 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Grade 10 DLL 11 14
No ratings yet
Grade 10 DLL 11 14
12 pages
Chapter Two Searching and Sorting: Algorithm
No ratings yet
Chapter Two Searching and Sorting: Algorithm
53 pages
CNN vs. RNN vs. ANN - Analysing 3 Types of Neural Networks in Deep Learning
No ratings yet
CNN vs. RNN vs. ANN - Analysing 3 Types of Neural Networks in Deep Learning
10 pages
Activations, Loss Functions & Optimizers in ML
No ratings yet
Activations, Loss Functions & Optimizers in ML
29 pages
Module 3-DL
No ratings yet
Module 3-DL
12 pages
DAA Unit-2
No ratings yet
DAA Unit-2
25 pages
Class 12 Chapter 12 Maths Important Formulas
No ratings yet
Class 12 Chapter 12 Maths Important Formulas
2 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Module2.3 Hyperparameter Optimization
No ratings yet
Module2.3 Hyperparameter Optimization
29 pages
Dubey Et Al. 2022 - Activation Functions in Deep Learning - A Comprehensive Survey and Benchmark
No ratings yet
Dubey Et Al. 2022 - Activation Functions in Deep Learning - A Comprehensive Survey and Benchmark
17 pages
UNIT3
No ratings yet
UNIT3
17 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Dive Into Deep Learning-435-462
No ratings yet
Dive Into Deep Learning-435-462
28 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
System of Linear Equations: Direct & Iterative Methods
No ratings yet
System of Linear Equations: Direct & Iterative Methods
36 pages
Deep Learning
No ratings yet
Deep Learning
49 pages
Unit 3
No ratings yet
Unit 3
47 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Deep Learning Module 3-1
No ratings yet
Deep Learning Module 3-1
31 pages
Week 6 and Week 7
No ratings yet
Week 6 and Week 7
39 pages
Chapter
No ratings yet
Chapter
46 pages
Isoparametric Formulation
No ratings yet
Isoparametric Formulation
40 pages
NN 1
No ratings yet
NN 1
21 pages
Deep Learning (MODULE-2)
No ratings yet
Deep Learning (MODULE-2)
86 pages
FA17-BCS-027 Lab
No ratings yet
FA17-BCS-027 Lab
27 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
5th Unit DL Final Class Notes
No ratings yet
5th Unit DL Final Class Notes
77 pages
Intro DL 01
No ratings yet
Intro DL 01
64 pages
4 Optimization
No ratings yet
4 Optimization
48 pages
Unit-I Machine Learning Basics
No ratings yet
Unit-I Machine Learning Basics
85 pages
Optimization
No ratings yet
Optimization
21 pages
ML Notes
No ratings yet
ML Notes
14 pages
Unit IV
No ratings yet
Unit IV
89 pages
DL 12
No ratings yet
DL 12
55 pages
Comparing ML Algorithms - Anjali Garg
No ratings yet
Comparing ML Algorithms - Anjali Garg
14 pages
Saifullah Faizan Pca1 Dsa
No ratings yet
Saifullah Faizan Pca1 Dsa
32 pages
Scilab Codes Sem III
No ratings yet
Scilab Codes Sem III
15 pages
Week 8 Prev & Current Assignments
No ratings yet
Week 8 Prev & Current Assignments
28 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
Machine Vesion hw6
No ratings yet
Machine Vesion hw6
18 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
Unit 2
No ratings yet
Unit 2
18 pages
Lecture-4 Emprical Risk and Optimization
No ratings yet
Lecture-4 Emprical Risk and Optimization
20 pages
07 - Solving Linear Systems by Elimination 1 Student Notes
No ratings yet
07 - Solving Linear Systems by Elimination 1 Student Notes
2 pages
Module3 Notes
No ratings yet
Module3 Notes
18 pages
Linear Programming - Explore 1
No ratings yet
Linear Programming - Explore 1
14 pages
DL 4
No ratings yet
DL 4
15 pages
Math130 Ass#6 Mercado
No ratings yet
Math130 Ass#6 Mercado
11 pages
1.explain The Concept of Empirical Risk Minimization. What Is The Goal of Optimization in Deep Learning?
No ratings yet
1.explain The Concept of Empirical Risk Minimization. What Is The Goal of Optimization in Deep Learning?
11 pages
Optimization Algorithm 0401
No ratings yet
Optimization Algorithm 0401
26 pages
Training Method: Iterative Trial and Error Process That Machine Learning Algorithms May Use To Train A Model
No ratings yet
Training Method: Iterative Trial and Error Process That Machine Learning Algorithms May Use To Train A Model
8 pages
CVL757: Finite Element Methods: IIT Delhi
No ratings yet
CVL757: Finite Element Methods: IIT Delhi
8 pages
Loss Functions in Machine Learning
No ratings yet
Loss Functions in Machine Learning
14 pages
Max Flow Homework
No ratings yet
Max Flow Homework
7 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
Assignment Computer Orientedstatistical and Optimization Techniques
No ratings yet
Assignment Computer Orientedstatistical and Optimization Techniques
4 pages
22MATS21 MATLAB Progs
No ratings yet
22MATS21 MATLAB Progs
6 pages
Lecture 04 - Optimization - 4p
No ratings yet
Lecture 04 - Optimization - 4p
11 pages
Role of An Optimizer
No ratings yet
Role of An Optimizer
9 pages
Deep Learning Via Hessian-Free Optimization: James Martens
No ratings yet
Deep Learning Via Hessian-Free Optimization: James Martens
8 pages
Assignment 2 Task 1: Test Selectionsort Algorithm 100 Data
No ratings yet
Assignment 2 Task 1: Test Selectionsort Algorithm 100 Data
9 pages
IMP Deep Learning
No ratings yet
IMP Deep Learning
9 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
10.algebraic Expressions
No ratings yet
10.algebraic Expressions
8 pages
SIMPLEX METHOD - Anna University - Coimbatore
No ratings yet
SIMPLEX METHOD - Anna University - Coimbatore
4 pages
Optimizers
No ratings yet
Optimizers
4 pages
DATA STRUCTURES LAB Syllabus
No ratings yet
DATA STRUCTURES LAB Syllabus
2 pages
Drying and Storage of Cereal Grains - 2016 - Bala - Appendix B Gaussian Elimination Method
No ratings yet
Drying and Storage of Cereal Grains - 2016 - Bala - Appendix B Gaussian Elimination Method
3 pages
Lec 2
No ratings yet
Lec 2
5 pages
Second Semester 2019-2020
No ratings yet
Second Semester 2019-2020
2 pages
Quadratic Functions: LESSON
No ratings yet
Quadratic Functions: LESSON
3 pages
IITKGP Assignment4 Solution4
No ratings yet
IITKGP Assignment4 Solution4
3 pages
Machine Learning CH - Nural Net
No ratings yet
Machine Learning CH - Nural Net
1 page
What Is Supervise
No ratings yet
What Is Supervise
3 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet