0% found this document useful (0 votes)
1 views

Unit 2

Uploaded by

Nalini Bangaram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Unit 2

Uploaded by

Nalini Bangaram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Unit -2

Training Neural Network :


Risk Minimization:
The goal of a machine learning algorithm is to reduce the expected
generalization error given by equation 8.2. This quantity is known as
the risk. We emphasize here that the expectation is taken over the true
underlying distribution pdata. If we knew the true distribution pdata(x,
y), risk minimization would be an optimization task solvable by an
optimization algorithm. However, when we do not know pdata(x, y) but
only have a training set of samples, we have a machine learning
problem.

The simplest way to convert a machine learning problem back into an


optimization problem is to minimize the expected loss on the training
set. This means replacing the true distribution p(x, y) with the empirical
distribution pˆ(x, y) defined by the training set. We now minimize the
empirical risk

where m is the number of training examples.


The training process based on minimizing this average training error is
known as empirical risk minimization. In this setting, machine learning
is still very similar to straightforward optimization. Rather than
optimizing the risk directly, we optimize the empirical risk, and hope
that the risk decreases significantly as well. A variety of theoretical
results establish conditions under which the true risk can be expected
to decrease by various amounts.

Loss Function :

Surrogate Loss Functions and Early Stopping :

Sometimes, the loss function we actually care about (say classification


error) is not one that can be optimized efficiently. For example, exactly
minimizing expected 0-1 loss is typically intractable (exponential in the
input dimension), even for a linear classifier (Marcotte and Savard,
1992). In such situations, one typically optimizes a surrogate loss
function instead, which acts as a proxy but has advantages. For
example, the negative log-likelihood of the correct class is typically used
as a surrogate for the 0-1 loss. The negative log-likelihood allows the
model to estimate the conditional probability of the classes, given the
input, and if the model can do that well, then it can pick the classes that
yield the least classification error in expectation.

In some cases, a surrogate loss function actually results in being able to


learn more. For example, the test set 0-1 loss often continues to
decrease for a long time after the training set 0-1 loss has reached zero,
when training using the log-likelihood surrogate. This is because even
when the expected 0-1 loss is zero, one can improve the robustness of
the classifier by further pushing the classes apart from each other,
obtaining a more confident and reliable classifier, thus extracting more
information from the training data than would have been possible by
simply minimizing the average 0-1 loss on the training set. A very
important difference between optimization in general and optimization
as we use it for training algorithms is that training algorithms do not
usually halt at a local minimum. Instead, a machine learning algorithm
usually minimizes a surrogate loss function but halts when a
convergence criterion based on early stopping (section 7.8) is satisfied.
Typically the early stopping criterion is based on the true underlying
loss function, such as 0-1 loss measured on a validation set, and is
designed to cause the algorithm to halt whenever overfitting begins to
occur. Training often halts while the surrogate loss function still has
large derivatives, which is very different from the pure optimization
setting, where an optimization algorithm is considered to have
converged when the gradient becomes very small.
Model Selection
The order of the polynomial controls the number of free parameters in
the model and thereby governs the model complexity. With regularized
least squares, the regularization coefficient λ also controls the effective
complexity of the model, whereas for more complex models, such as
mixture distributions or neural networks there may be multiple
parameters governing complexity. In a practical application, we need to
determine the values of such parameters, and the principal objective in
doing so is usually to achieve the best predictive performance on new
data.

Furthermore, as well as finding the appropriate values for complexity


parameters within a given model, we may wish to consider a range of
different types of model in order to find the best one for our particular
application.

We have already seen that, in the maximum likelihood approach, the


performance on the training set is not a good indicator of predictive
performance on unseen data due to the problem of over-fitting. If data
is plentiful, then one approach is simply to use some of the available
data to train a range of models, or a given model with a range of values
for its complexity parameters, and then to compare them on
independent data, sometimes called a validation set, and select the one
having the best predictive performance.

If the model design is iterated many times using a limited size data set,
then some over-fitting to the validation data can occur and so it may be
necessary to keep aside a third test set on which the performance of
the selected model is finally evaluated.

In many applications, however, the supply of data for training and


testing will be limited, and in order to build good models, we wish to
use as much of the available data as possible for training. However, if
the validation set is small, it will give a relatively noisy estimate of
predictive performance. One solution to this dilemma is to use cross-
validation, which is illustrated in Figure 1.18.

This allows a proportion (S − 1)/S of the available data to be used for


training while making use of all of the data to assess performance.
When data is particularly scarce, it may be appropriate to consider the
case S = N, where N is the total number of data points, which gives the
leave-one-out technique.

One major drawback of cross-validation is that the number of training


runs that must be performed is increased by a factor of S, and this can
prove problematic for models in which the training is itself
computationally expensive. A further problem with techniques such as
cross-validation that use separate data to assess performance is that
we might have multiple complexity parameters for a single model (for
instance, there might be several regularization parameters). Exploring
combinations of settings for such parameters could, in the worst case,
require a number of training runs that is exponential in the number of
parameters.

Clearly, we need a better approach. Ideally, this should rely only on the
training data and should allow multiple hyperparameters and model
types to be compared in a single training run. We therefore need to
find a measure of performance which depends only on the training data
and which does not suffer from bias due to over-fitting.

Optimization :
Machine learning algorithms usually require a high amount of numerical
computation. This typically refers to algorithms that solve mathematical
problems by methods that update estimates of the solution via an
iterative process, rather than analytically deriving a formula providing a
symbolic expression for the correct solution. Common operations
include optimization (finding the value of an argument that minimizes or
maximizes a function) and solving systems of linear equations.

Gradient-Based Optimization :
Most deep learning algorithms involve optimization of some
sort. Optimization refers to the task of either minimizing or maximizing
some function f(x) by altering x. We usually phrase most optimization
problems in terms of minimizing f(x). Maximization may be accomplished
via a minimization algorithm by minimizing −f(x). The function we want
to minimize or maximize is called the objective func- tion or criterion.
When we are minimizing it, we may also call it the cost function, loss
function, or error function.
Difficulty of training deep neural networks
Challenges Motivating Deep Learning:
The simple machine learning algorithms have not succeeded in solving
the central problems in AI, such as recognizing speech or recognizing
objects. The development of deep learning was motivated in part by the
failure of traditional algorithms to generalize well on such AI tasks.

This section is about how the challenge of generalizing to new examples


becomes exponentially more difficult when working with high-
dimensional data, and how the mechanisms used to achieve
generalization in traditional machine learning are insufficient to learn
complicated functions in high-dimensional spaces. Such spaces also
often impose high computational costs. Deep learning was designed to
overcome these and other obstacles

The Curse of Dimensionality :


Many machine learning problems become exceedingly difficult when the
number of dimensions in the data is high. This phenomenon is known as
the curse of dimensionality. Of particular concern is that the number of
possible distinct configurations of a set of variables increases
exponentially as the number of variables increases.
The curse of dimensionality arises in many places in computer science,
and especially so in machine learning. One challenge posed by the curse
of dimensionality is a statistical challenge. As illustrated in figure 5.9, a
statistical challenge arises because the number of possible
configurations of x is much larger than the number of training
examples. To understand the issue, let us consider that the input space
is organized into a grid, like in the figure. We can describe low-
dimensional space with a low number of grid cells that are mostly
occupied by the data. When generalizing to a new data point, we can
usually tell what to do simply by inspecting the training examples that
lie in the same cell as the new input. For example, if estimating the
probability density at some point x, we can just return the number of
training examples in the same unit volume cell as x, divided by the total
number of training examples. If we wish to classify an example, we can
return the most common class of training examples in the same cell. If
we are doing regression we can average the target values observed
over the examples in that cell. But what about the cells for which we
have seen no example ? Because in high-dimensional spaces the
number of configurations is huge, much larger than our number of
examples, a typical grid cell has no training example associated with it.
How could we possibly say something meaningful about these new
configurations? Many traditional machine learning algorithms simply
assume that the output at a new point should be approximately the
same as the output at the nearest training point.

Local Constancy and Smoothness Regularization


In order to generalize well, machine learning algorithms need to be
guided by prior beliefs about what kind of function they should learn.
Previously, we have seen these priors incorporated as explicit beliefs in
the form of probability distributions over parameters of the model.
More informally, we may also discuss prior beliefs as directly
influencing the function itself and only indirectly acting on the
parameters via their effect on the function. Additionally, we informally
discuss prior beliefs as being expressed implicitly, by choosing
algorithms that are biased toward choosing some class of functions
over another, even though these biases may not be expressed (or even
possible to express) in terms of a probability distribution representing
our degree of belief in various functions. Among the most widely used
of these implicit “priors” is the smoothness prior or local constancy
prior. This prior states that the function we learn should not change
very much within a small region. Many simpler algorithms rely
exclusively on this prior to generalize well, and as a result they fail to
scale to the statistical challenges involved in solving AI level tasks.

While the k-nearest neighbors algorithm copies the output from nearby
training examples, most kernel machines interpolate between training
set outputs associated with nearby training examples.

Decision trees also suffer from the limitations of exclusively


smoothness-based learning because they break the input space into as
many regions as there are leaves and use a separate parameter (or
sometimes many parameters for extensions of decision trees) in each
region. If the target function requires a tree with at least n leaves to be
represented accurately, then at least n training examples are required
to fit the tree. A multiple of n is needed to achieve some level of
statistical confidence in the predicted output.

Other approaches to machine learning often make stronger, task-


specific as- sumptions. For example, we could easily solve the
checkerboard task by providing the assumption that the target function
is periodic. Usually we do not include such strong, task-specific
assumptions into neural networks so that they can generalize to a much
wider variety of structures. AI tasks have structure that is much too
complex to be limited to simple, manually specified properties such as
periodicity, so we want learning algorithms that embody more general-
purpose assumptions. The core idea in deep learning is that we assume
that the data was generated by the composition of factors or features,
potentially at multiple levels in a hierarchy. Many other similarly
generic assumptions can further improve deep learning algorithms.
Manifold Learning
An important concept underlying many ideas in machine learning is
that of a manifold.

A manifold is a connected region. Mathematically, it is a set of points,


associated with a neighborhood around each point. From any given
point, the manifold locally appears to be a Euclidean space. In everyday
life, we experience the surface of the world as a 2-D plane, but it is in
fact a spherical manifold in 3-D space.

The definition of a neighborhood surrounding each point implies the


existence of transformations that can be applied to move on the
manifold from one position to a neighboring one. In the example of the
world’s surface as a manifold, one can walk north, south, east, or west.

Although there is a formal mathematical meaning to the term


“manifold,” in machine learning it tends to be used more loosely to
designate a connected set of points that can be approximated well by
considering only a small number of degrees of freedom, or dimensions,
embedded in a higher-dimensional space. Each dimension corresponds
to a local direction of variation.

In the context of machine learning, we allow the dimensionality of the


manifold to vary from one point to another. This often happens when a
manifold intersects itself.

The assumption that the data lies along a low-dimensional manifold


may not always be correct or useful. We argue that in the context of AI
tasks, such as those that involve processing images, sounds, or text, the
manifold assumption is at least approximately correct. The evidence in
favor of this assumption consists of two categories of observations

The first observation in favor of the manifold hypothesis is that the


probability distribution over images, text strings, and sounds that occur
in real life is highly concentrated.

The second argument in favor of the manifold hypothesis is that we can


also imagine such neighborhoods and transformations, at least
informally. In the case of images, we can certainly think of many
possible transformations that allow us to trace out a manifold in image
space: we can gradually dim or brighten the lights, gradually move or
rotate objects in the image, gradually alter the colors on the surfaces of
objects, etc.

Greedy layer wise training :


Sometimes, directly training a model to solve a specific task can be too
ambitious if the model is complex and hard to optimize or if the task is
very difficult. It is sometimes more effective to train a simpler model to
solve the task, then make the model more complex. It can also be more
effective to train the model to solve a simpler task, then move on to
confront the final task. These strategies that involve training simple
models on simple tasks before confronting the challenge of training the
desired model to perform the desired task are collectively known as
pretraining

Greedy algorithms break a problem into many components, then solve


for the optimal version of each component in isolation. Unfortunately,
combining the individually optimal components is not guaranteed to
yield an optimal complete solution. However, greedy algorithms can be
computationally much cheaper than algorithms that solve for the best
joint solution, and the quality of a greedy solution is often acceptable if
not optimal. Greedy algorithms may also be followed by a fine-tuning
stage in which a joint optimization algorithm searches for an optimal
solution to the full problem. Initializing the joint optimization algorithm
with a greedy solution can greatly speed it up and improve the quality
of the solution it finds.

Pretraining, and especially greedy pretraining, algorithms are


ubiquitous in deep learning ,pretraining algorithms that break
supervised learning problems into other simpler supervised learning
problems. This approach is known as greedy supervised pretraining.

Why would greedy supervised pretraining help? The hypothesis helps


to provide better guidance to intermediate levels of a deep hierarchy.
In general, pretraining may help both in terms of optimization and in
terms of generalization.

An example of greedy supervised pretraining is illustrated in figure 8.7,


in which each added hidden layer is pretrained as part of a shallow
supervised MLP, taking as input the output of the previously trained
hidden layer. Instead of pretraining one layer at a time, Simonyan and
Zisserman (2015) pretrain a deep convolutional network (eleven weight
layers) and then use the first four and last three layers from this
network to initialize even deeper networks (with up to nineteen layers
of weights). The middle layers of the new, very deep network are
initialized randomly The new network is then jointly trained.
Regularization :
A central problem in machine learning is how to make an algorithm that
will perform well not just on the training data, but also on new inputs.
Many strategies used in machine learning are explicitly designed to
reduce the test error, possibly at the expense of increased training
error. These strategies are known collectively as regularization.

In the context of deep learning, most regularization strategies are


based on regularizing estimators. Regularization of an estimator works
by trading increased bias for reduced variance. An effective regularizer
is one that makes a profitable trade, reducing variance significantly
while not overly increasing the bias.

You might also like