0% found this document useful (0 votes)

5 views18 pages

Unit 2

Uploaded by

Nalini Bangaram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views18 pages

Unit 2

Uploaded by

Nalini Bangaram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Unit -2

Training Neural Network :

Risk Minimization:
The goal of a machine learning algorithm is to reduce the expected
generalization error given by equation 8.2. This quantity is known as
the risk. We emphasize here that the expectation is taken over the true
underlying distribution pdata. If we knew the true distribution pdata(x,
y), risk minimization would be an optimization task solvable by an
optimization algorithm. However, when we do not know pdata(x, y) but
only have a training set of samples, we have a machine learning
problem.

The simplest way to convert a machine learning problem back into an

optimization problem is to minimize the expected loss on the training
set. This means replacing the true distribution p(x, y) with the empirical
distribution pˆ(x, y) defined by the training set. We now minimize the
empirical risk

where m is the number of training examples.

The training process based on minimizing this average training error is
known as empirical risk minimization. In this setting, machine learning
is still very similar to straightforward optimization. Rather than
optimizing the risk directly, we optimize the empirical risk, and hope
that the risk decreases significantly as well. A variety of theoretical
results establish conditions under which the true risk can be expected
to decrease by various amounts.

Loss Function :

Surrogate Loss Functions and Early Stopping :

Sometimes, the loss function we actually care about (say classification

error) is not one that can be optimized efficiently. For example, exactly
minimizing expected 0-1 loss is typically intractable (exponential in the
input dimension), even for a linear classifier (Marcotte and Savard,
1992). In such situations, one typically optimizes a surrogate loss
function instead, which acts as a proxy but has advantages. For
example, the negative log-likelihood of the correct class is typically used
as a surrogate for the 0-1 loss. The negative log-likelihood allows the
model to estimate the conditional probability of the classes, given the
input, and if the model can do that well, then it can pick the classes that
yield the least classification error in expectation.

In some cases, a surrogate loss function actually results in being able to

learn more. For example, the test set 0-1 loss often continues to
decrease for a long time after the training set 0-1 loss has reached zero,
when training using the log-likelihood surrogate. This is because even
when the expected 0-1 loss is zero, one can improve the robustness of
the classifier by further pushing the classes apart from each other,
obtaining a more confident and reliable classifier, thus extracting more
information from the training data than would have been possible by
simply minimizing the average 0-1 loss on the training set. A very
important difference between optimization in general and optimization
as we use it for training algorithms is that training algorithms do not
usually halt at a local minimum. Instead, a machine learning algorithm
usually minimizes a surrogate loss function but halts when a
convergence criterion based on early stopping (section 7.8) is satisfied.
Typically the early stopping criterion is based on the true underlying
loss function, such as 0-1 loss measured on a validation set, and is
designed to cause the algorithm to halt whenever overfitting begins to
occur. Training often halts while the surrogate loss function still has
large derivatives, which is very different from the pure optimization
setting, where an optimization algorithm is considered to have
converged when the gradient becomes very small.
Model Selection
The order of the polynomial controls the number of free parameters in
the model and thereby governs the model complexity. With regularized
least squares, the regularization coefficient λ also controls the effective
complexity of the model, whereas for more complex models, such as
mixture distributions or neural networks there may be multiple
parameters governing complexity. In a practical application, we need to
determine the values of such parameters, and the principal objective in
doing so is usually to achieve the best predictive performance on new
data.

Furthermore, as well as finding the appropriate values for complexity

parameters within a given model, we may wish to consider a range of
different types of model in order to find the best one for our particular
application.

We have already seen that, in the maximum likelihood approach, the

performance on the training set is not a good indicator of predictive
performance on unseen data due to the problem of over-fitting. If data
is plentiful, then one approach is simply to use some of the available
data to train a range of models, or a given model with a range of values
for its complexity parameters, and then to compare them on
independent data, sometimes called a validation set, and select the one
having the best predictive performance.

If the model design is iterated many times using a limited size data set,
then some over-fitting to the validation data can occur and so it may be
necessary to keep aside a third test set on which the performance of
the selected model is finally evaluated.

In many applications, however, the supply of data for training and

testing will be limited, and in order to build good models, we wish to
use as much of the available data as possible for training. However, if
the validation set is small, it will give a relatively noisy estimate of
predictive performance. One solution to this dilemma is to use cross-
validation, which is illustrated in Figure 1.18.

This allows a proportion (S − 1)/S of the available data to be used for

training while making use of all of the data to assess performance.
When data is particularly scarce, it may be appropriate to consider the
case S = N, where N is the total number of data points, which gives the
leave-one-out technique.

One major drawback of cross-validation is that the number of training

runs that must be performed is increased by a factor of S, and this can
prove problematic for models in which the training is itself
computationally expensive. A further problem with techniques such as
cross-validation that use separate data to assess performance is that
we might have multiple complexity parameters for a single model (for
instance, there might be several regularization parameters). Exploring
combinations of settings for such parameters could, in the worst case,
require a number of training runs that is exponential in the number of
parameters.

Clearly, we need a better approach. Ideally, this should rely only on the
training data and should allow multiple hyperparameters and model
types to be compared in a single training run. We therefore need to
find a measure of performance which depends only on the training data
and which does not suffer from bias due to over-fitting.

Optimization :
Machine learning algorithms usually require a high amount of numerical
computation. This typically refers to algorithms that solve mathematical
problems by methods that update estimates of the solution via an
iterative process, rather than analytically deriving a formula providing a
symbolic expression for the correct solution. Common operations
include optimization (finding the value of an argument that minimizes or
maximizes a function) and solving systems of linear equations.

Gradient-Based Optimization :
Most deep learning algorithms involve optimization of some
sort. Optimization refers to the task of either minimizing or maximizing
some function f(x) by altering x. We usually phrase most optimization
problems in terms of minimizing f(x). Maximization may be accomplished
via a minimization algorithm by minimizing −f(x). The function we want
to minimize or maximize is called the objective function or criterion.
When we are minimizing it, we may also call it the cost function, loss
function, or error function.
Difficulty of training deep neural networks
Challenges Motivating Deep Learning:
The simple machine learning algorithms have not succeeded in solving
the central problems in AI, such as recognizing speech or recognizing
objects. The development of deep learning was motivated in part by the
failure of traditional algorithms to generalize well on such AI tasks.

This section is about how the challenge of generalizing to new examples

becomes exponentially more difficult when working with high-
dimensional data, and how the mechanisms used to achieve
generalization in traditional machine learning are insufficient to learn
complicated functions in high-dimensional spaces. Such spaces also
often impose high computational costs. Deep learning was designed to
overcome these and other obstacles

The Curse of Dimensionality :

Many machine learning problems become exceedingly difficult when the
number of dimensions in the data is high. This phenomenon is known as
the curse of dimensionality. Of particular concern is that the number of
possible distinct configurations of a set of variables increases
exponentially as the number of variables increases.
The curse of dimensionality arises in many places in computer science,
and especially so in machine learning. One challenge posed by the curse
of dimensionality is a statistical challenge. As illustrated in figure 5.9, a
statistical challenge arises because the number of possible
configurations of x is much larger than the number of training
examples. To understand the issue, let us consider that the input space
is organized into a grid, like in the figure. We can describe low-
dimensional space with a low number of grid cells that are mostly
occupied by the data. When generalizing to a new data point, we can
usually tell what to do simply by inspecting the training examples that
lie in the same cell as the new input. For example, if estimating the
probability density at some point x, we can just return the number of
training examples in the same unit volume cell as x, divided by the total
number of training examples. If we wish to classify an example, we can
return the most common class of training examples in the same cell. If
we are doing regression we can average the target values observed
over the examples in that cell. But what about the cells for which we
have seen no example ? Because in high-dimensional spaces the
number of configurations is huge, much larger than our number of
examples, a typical grid cell has no training example associated with it.
How could we possibly say something meaningful about these new
configurations? Many traditional machine learning algorithms simply
assume that the output at a new point should be approximately the
same as the output at the nearest training point.

Local Constancy and Smoothness Regularization

In order to generalize well, machine learning algorithms need to be
guided by prior beliefs about what kind of function they should learn.
Previously, we have seen these priors incorporated as explicit beliefs in
the form of probability distributions over parameters of the model.
More informally, we may also discuss prior beliefs as directly
influencing the function itself and only indirectly acting on the
parameters via their effect on the function. Additionally, we informally
discuss prior beliefs as being expressed implicitly, by choosing
algorithms that are biased toward choosing some class of functions
over another, even though these biases may not be expressed (or even
possible to express) in terms of a probability distribution representing
our degree of belief in various functions. Among the most widely used
of these implicit “priors” is the smoothness prior or local constancy
prior. This prior states that the function we learn should not change
very much within a small region. Many simpler algorithms rely
exclusively on this prior to generalize well, and as a result they fail to
scale to the statistical challenges involved in solving AI level tasks.

While the k-nearest neighbors algorithm copies the output from nearby
training examples, most kernel machines interpolate between training
set outputs associated with nearby training examples.

Decision trees also suffer from the limitations of exclusively

smoothness-based learning because they break the input space into as
many regions as there are leaves and use a separate parameter (or
sometimes many parameters for extensions of decision trees) in each
region. If the target function requires a tree with at least n leaves to be
represented accurately, then at least n training examples are required
to fit the tree. A multiple of n is needed to achieve some level of
statistical confidence in the predicted output.

Other approaches to machine learning often make stronger, task-

specific assumptions. For example, we could easily solve the
checkerboard task by providing the assumption that the target function
is periodic. Usually we do not include such strong, task-specific
assumptions into neural networks so that they can generalize to a much
wider variety of structures. AI tasks have structure that is much too
complex to be limited to simple, manually specified properties such as
periodicity, so we want learning algorithms that embody more general-
purpose assumptions. The core idea in deep learning is that we assume
that the data was generated by the composition of factors or features,
potentially at multiple levels in a hierarchy. Many other similarly
generic assumptions can further improve deep learning algorithms.
Manifold Learning
An important concept underlying many ideas in machine learning is
that of a manifold.

A manifold is a connected region. Mathematically, it is a set of points,

associated with a neighborhood around each point. From any given
point, the manifold locally appears to be a Euclidean space. In everyday
life, we experience the surface of the world as a 2-D plane, but it is in
fact a spherical manifold in 3-D space.

The definition of a neighborhood surrounding each point implies the

existence of transformations that can be applied to move on the
manifold from one position to a neighboring one. In the example of the
world’s surface as a manifold, one can walk north, south, east, or west.

Although there is a formal mathematical meaning to the term

“manifold,” in machine learning it tends to be used more loosely to
designate a connected set of points that can be approximated well by
considering only a small number of degrees of freedom, or dimensions,
embedded in a higher-dimensional space. Each dimension corresponds
to a local direction of variation.

In the context of machine learning, we allow the dimensionality of the

manifold to vary from one point to another. This often happens when a
manifold intersects itself.

The assumption that the data lies along a low-dimensional manifold

may not always be correct or useful. We argue that in the context of AI
tasks, such as those that involve processing images, sounds, or text, the
manifold assumption is at least approximately correct. The evidence in
favor of this assumption consists of two categories of observations

The first observation in favor of the manifold hypothesis is that the

probability distribution over images, text strings, and sounds that occur
in real life is highly concentrated.

The second argument in favor of the manifold hypothesis is that we can

also imagine such neighborhoods and transformations, at least
informally. In the case of images, we can certainly think of many
possible transformations that allow us to trace out a manifold in image
space: we can gradually dim or brighten the lights, gradually move or
rotate objects in the image, gradually alter the colors on the surfaces of
objects, etc.

Greedy layer wise training :

Sometimes, directly training a model to solve a specific task can be too
ambitious if the model is complex and hard to optimize or if the task is
very difficult. It is sometimes more effective to train a simpler model to
solve the task, then make the model more complex. It can also be more
effective to train the model to solve a simpler task, then move on to
confront the final task. These strategies that involve training simple
models on simple tasks before confronting the challenge of training the
desired model to perform the desired task are collectively known as
pretraining

Greedy algorithms break a problem into many components, then solve

for the optimal version of each component in isolation. Unfortunately,
combining the individually optimal components is not guaranteed to
yield an optimal complete solution. However, greedy algorithms can be
computationally much cheaper than algorithms that solve for the best
joint solution, and the quality of a greedy solution is often acceptable if
not optimal. Greedy algorithms may also be followed by a fine-tuning
stage in which a joint optimization algorithm searches for an optimal
solution to the full problem. Initializing the joint optimization algorithm
with a greedy solution can greatly speed it up and improve the quality
of the solution it finds.

Pretraining, and especially greedy pretraining, algorithms are

ubiquitous in deep learning ,pretraining algorithms that break
supervised learning problems into other simpler supervised learning
problems. This approach is known as greedy supervised pretraining.

Why would greedy supervised pretraining help? The hypothesis helps

to provide better guidance to intermediate levels of a deep hierarchy.
In general, pretraining may help both in terms of optimization and in
terms of generalization.

An example of greedy supervised pretraining is illustrated in figure 8.7,

in which each added hidden layer is pretrained as part of a shallow
supervised MLP, taking as input the output of the previously trained
hidden layer. Instead of pretraining one layer at a time, Simonyan and
Zisserman (2015) pretrain a deep convolutional network (eleven weight
layers) and then use the first four and last three layers from this
network to initialize even deeper networks (with up to nineteen layers
of weights). The middle layers of the new, very deep network are
initialized randomly The new network is then jointly trained.
Regularization :
A central problem in machine learning is how to make an algorithm that
will perform well not just on the training data, but also on new inputs.
Many strategies used in machine learning are explicitly designed to
reduce the test error, possibly at the expense of increased training
error. These strategies are known collectively as regularization.

In the context of deep learning, most regularization strategies are

based on regularizing estimators. Regularization of an estimator works
by trading increased bias for reduced variance. An effective regularizer
is one that makes a profitable trade, reducing variance significantly
while not overly increasing the bias.

Unit 2 Introduction To Deep Learning
No ratings yet
Unit 2 Introduction To Deep Learning
79 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Machine Learning-2
No ratings yet
Machine Learning-2
16 pages
Anunnaki
No ratings yet
Anunnaki
97 pages
IMSO 19th Preparation Statement - Final
No ratings yet
IMSO 19th Preparation Statement - Final
1 page
ML - Mid2
No ratings yet
ML - Mid2
24 pages
Farm Management in Peasant Agriculture-CRC Press (1983)
No ratings yet
Farm Management in Peasant Agriculture-CRC Press (1983)
484 pages
Lab Guide Training v2.0 PDF
No ratings yet
Lab Guide Training v2.0 PDF
36 pages
Supervised Learning
No ratings yet
Supervised Learning
4 pages
Talent 100 HSC Study Guide
100% (5)
Talent 100 HSC Study Guide
39 pages
Module 3-DL
No ratings yet
Module 3-DL
12 pages
NIPS 2017 Information Theoretic Analysis of Generalization Capability of Learning Algorithms Paper
No ratings yet
NIPS 2017 Information Theoretic Analysis of Generalization Capability of Learning Algorithms Paper
10 pages
School of Computing and Information Systems The University of Melbourne COMP90049 Introduction To Machine Learning (Semester 1, 2022)
No ratings yet
School of Computing and Information Systems The University of Melbourne COMP90049 Introduction To Machine Learning (Semester 1, 2022)
4 pages
ACC 222 Costing
No ratings yet
ACC 222 Costing
17 pages
Receiver Operator Characteristic
No ratings yet
Receiver Operator Characteristic
25 pages
Optimization For Deep Learning - An Overview
No ratings yet
Optimization For Deep Learning - An Overview
51 pages
9540WTS 9560WTS 9580WTS Combines MY 2001 2004 Europe Edition Introduction
No ratings yet
9540WTS 9560WTS 9580WTS Combines MY 2001 2004 Europe Edition Introduction
6 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
ML Assignment
No ratings yet
ML Assignment
7 pages
Supervised Learning
No ratings yet
Supervised Learning
8 pages
Machine Learning1
No ratings yet
Machine Learning1
8 pages
Supervised Learning
No ratings yet
Supervised Learning
4 pages
JDM Extreme
No ratings yet
JDM Extreme
4 pages
CS405-6 2 1 2-Wikipedia
No ratings yet
CS405-6 2 1 2-Wikipedia
7 pages
Deep Learning Module 3-1
No ratings yet
Deep Learning Module 3-1
31 pages
Linear Regression, Polynomical, Gradiant Descent
No ratings yet
Linear Regression, Polynomical, Gradiant Descent
42 pages
Module3 Notes
No ratings yet
Module3 Notes
18 pages
Datasheet - Cios Connect
No ratings yet
Datasheet - Cios Connect
16 pages
CV Lec4
No ratings yet
CV Lec4
46 pages
16 Passage 2 - Computer Provides More Questions Than Answers Q14-26
No ratings yet
16 Passage 2 - Computer Provides More Questions Than Answers Q14-26
6 pages
DL 12
No ratings yet
DL 12
55 pages
ML Merge
No ratings yet
ML Merge
145 pages
Unit 4 Final
No ratings yet
Unit 4 Final
29 pages
Multi-Stage Payment Methods
No ratings yet
Multi-Stage Payment Methods
11 pages
L8 Ann
No ratings yet
L8 Ann
20 pages
3.2.9. Rubber Closures For Containers For Aqueous Parenteral Preparations, For Powders and For Freeze-Dried Powders
No ratings yet
3.2.9. Rubber Closures For Containers For Aqueous Parenteral Preparations, For Powders and For Freeze-Dried Powders
2 pages
Electrical Master Electrician
No ratings yet
Electrical Master Electrician
2 pages
Accelerated Bayesian Optimization For Deep Learning
No ratings yet
Accelerated Bayesian Optimization For Deep Learning
13 pages
Loss
No ratings yet
Loss
18 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Unit 4 A
No ratings yet
Unit 4 A
16 pages
50 Quick Ideas
No ratings yet
50 Quick Ideas
216 pages
ML 22-23 Sem, GPT
No ratings yet
ML 22-23 Sem, GPT
14 pages
School of Computer Science and Applied Mathematics
No ratings yet
School of Computer Science and Applied Mathematics
12 pages
Early Stopping, Dropout, Augmentation, Optimizers New
No ratings yet
Early Stopping, Dropout, Augmentation, Optimizers New
91 pages
ML Answerbank
No ratings yet
ML Answerbank
14 pages
Unit 3
No ratings yet
Unit 3
47 pages
Optimization For Deep Learning Theory and Algorithms
No ratings yet
Optimization For Deep Learning Theory and Algorithms
60 pages
Lecture-4 Emprical Risk and Optimization
No ratings yet
Lecture-4 Emprical Risk and Optimization
20 pages
DL M2 Regularization
No ratings yet
DL M2 Regularization
12 pages
Unit Ii ML
No ratings yet
Unit Ii ML
57 pages
Lecture-16 Machine Learning With Python
No ratings yet
Lecture-16 Machine Learning With Python
39 pages
Complete ML Notes
No ratings yet
Complete ML Notes
62 pages
Unit 1.2 Perceptron 2024
No ratings yet
Unit 1.2 Perceptron 2024
107 pages
ML 1 2 3
No ratings yet
ML 1 2 3
54 pages
Model Parameters
No ratings yet
Model Parameters
26 pages
Machine Learning
No ratings yet
Machine Learning
87 pages
Unit 2
No ratings yet
Unit 2
37 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Chapter
No ratings yet
Chapter
46 pages
Unit-I Machine Learning Basics
No ratings yet
Unit-I Machine Learning Basics
85 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
VTU Exam Question Paper With Solution of 18MCA53 Machine Learning Feb-2022-Dr - Gnaneswari
No ratings yet
VTU Exam Question Paper With Solution of 18MCA53 Machine Learning Feb-2022-Dr - Gnaneswari
27 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
A "Short" Introduction To Model Selection
No ratings yet
A "Short" Introduction To Model Selection
25 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
1 Intro
No ratings yet
1 Intro
91 pages
ML Notes
No ratings yet
ML Notes
14 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
SP18 CS182 Midterm Solutions - Edited
No ratings yet
SP18 CS182 Midterm Solutions - Edited
14 pages
Squishy Circuits Handout
No ratings yet
Squishy Circuits Handout
9 pages
Resume Columbia
No ratings yet
Resume Columbia
1 page
BA Material
No ratings yet
BA Material
103 pages
Dens Shield
No ratings yet
Dens Shield
16 pages
ICOMOS, 2004. The WHL Filling The Gaps
No ratings yet
ICOMOS, 2004. The WHL Filling The Gaps
98 pages
C23bce Key
No ratings yet
C23bce Key
10 pages
Probability and Reliability Aspects in P PDF
No ratings yet
Probability and Reliability Aspects in P PDF
5 pages
Ds Key
No ratings yet
Ds Key
9 pages
Anti Corrosion UV Curable Coatings
No ratings yet
Anti Corrosion UV Curable Coatings
3 pages
ML Summary PDF
No ratings yet
ML Summary PDF
5 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
9 pages
Sikyon Sample
No ratings yet
Sikyon Sample
66 pages
Core Mathematics 4 Jun14
No ratings yet
Core Mathematics 4 Jun14
4 pages
Impact of Gender Diversity On Team Performance SM Raza Naqvi
No ratings yet
Impact of Gender Diversity On Team Performance SM Raza Naqvi
8 pages
Kung 2015
No ratings yet
Kung 2015
37 pages
3.7 Design Formulas For Siphons
No ratings yet
3.7 Design Formulas For Siphons
6 pages
Formatting Tags Available in ArcMap
No ratings yet
Formatting Tags Available in ArcMap
11 pages
BA
No ratings yet
BA
6 pages
Anu CV
No ratings yet
Anu CV
2 pages
GEHealthcare Transport Pro Monitor Spec Sheet
No ratings yet
GEHealthcare Transport Pro Monitor Spec Sheet
2 pages
GTSR Ir 6653
No ratings yet
GTSR Ir 6653
2 pages
MAF11 Revision Question Sem 1 2019 Final Exam
No ratings yet
MAF11 Revision Question Sem 1 2019 Final Exam
2 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet

Unit 2

Uploaded by

Unit 2

Uploaded by

Unit -2

Training Neural Network :

The simplest way to convert a machine learning problem back into an

where m is the number of training examples.

Surrogate Loss Functions and Early Stopping :

Sometimes, the loss function we actually care about (say classification

In some cases, a surrogate loss function actually results in being able to

Furthermore, as well as finding the appropriate values for complexity

We have already seen that, in the maximum likelihood approach, the

In many applications, however, the supply of data for training and

This allows a proportion (S − 1)/S of the available data to be used for

One major drawback of cross-validation is that the number of training

This section is about how the challenge of generalizing to new examples

The Curse of Dimensionality :

Local Constancy and Smoothness Regularization

Decision trees also suffer from the limitations of exclusively

Other approaches to machine learning often make stronger, task-

A manifold is a connected region. Mathematically, it is a set of points,

The definition of a neighborhood surrounding each point implies the

Although there is a formal mathematical meaning to the term

In the context of machine learning, we allow the dimensionality of the

The assumption that the data lies along a low-dimensional manifold

The first observation in favor of the manifold hypothesis is that the

The second argument in favor of the manifold hypothesis is that we can

Greedy layer wise training :

Greedy algorithms break a problem into many components, then solve

Pretraining, and especially greedy pretraining, algorithms are

Why would greedy supervised pretraining help? The hypothesis helps

An example of greedy supervised pretraining is illustrated in figure 8.7,

In the context of deep learning, most regularization strategies are

You might also like