0% found this document useful (0 votes)
10 views

4.machine Learning Basics (C)

Deep Learning

Uploaded by

Kavitha
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

4.machine Learning Basics (C)

Deep Learning

Uploaded by

Kavitha
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

4.

MACHINE LEARNING BASICS: CAPACITY OVERFITTING AND UNDERFITTING

 The central challenge in machine learning is that we must perform well on new,
previously unseen inputs—not just those on which our model was trained. The
ability to perform well on previously unobserved inputs is called generalization.

 When training a machine learning model, we have access to a training set, we can
compute some error measure on the training set called the training error, and we
reduce this training error. So far, what we have described is simply an
optimization problem. What separates machine learning from optimization is that
we want the generalization error, also called the test error, to be low as well.

 The generalization error is defined as the expected value of the error on a new
input. Here the expectation is taken across di fferent possible inputs, drawn from
the distribution of inputs we expect the system to encounter in practice.

 We typically estimate the generalization error of a machine learning model by


measuring its performance on a test set of examples that were collected separately
from the training set.

 In our linear regression example we trained the model by minimizing the training
error

How can
we affect performance on the test set when we get to observe only the training set?

 The field of statistical learning theory provides some answers. If the training and the test
set are collected arbitrarily, there is indeed little we can do. If we are allowed to make
some assumptions about how the training and test set are collected, then we can make
some progress.

 The train and test data are generated by a probability distribution over datasets called
the data generating process. We typically make a set of assumptions known
collectively as the assumptions.

 These assumptions are that the examples in each dataset are independent from each
other, and that the train set and test set are identically distributed, drawn from the
same probability distribution as each other. This assumption allows us to describe the
data generating process with a probability distribution over a single example. The
same distribution is then used to generate every train example and every test example.
 We call that shared underlying distribution the data generating distribution, denoted
pdata. This probabilistic framework and the i.e. assumptions allow us to
mathematically study the relationship between training error and test error.

 When we use a machine learning algorithm, we do not fix the parameters ahead of
time, then sample both datasets. We sample the training set, then use it to choose
the parameters to reduce training set error, then sample the test set. Under this
process, the expected test error is greater than or equal to the expected value of
training error.

The factors determining how well a machine learning algorithm will perform are its
ability to:

 Make the training error small.


 Make the gap between training and test error small.
These two factors correspond to the two central challenges in machine learning:
underfitting and overfitting .

 Underfitting occurs when the model is not able to obtain a sufficiently low error
value on the training set.

 Overfitting occurs when the gap between the training error and test error is
too large.

We can control whether a model is more likely to overfit or underfit by altering its
capacity.

 Informally, a model’s capacity is its ability to fit a wide variety of functions.

 Models with low capacity may struggle to fit the training set.

 Models with high capacity can overfit by memorizing properties of the training set
that do not serve them well on the test set.

One way to control the capacity of a learning algorithm is by choosing its hypothesis
space, the set of functions that the learning algorithm is allowed to select as being the
solution.
For example:

 The linear regression algorithm has the set of all linear functions of its input as its
hypothesis space.

 A polynomial of degree one gives us the linear regression model with which we are
already familiar with prediction

 yˆ = b + wx.

 By introducing x2 as another feature provided to the linear regression model, we


can learn a model that is quadratic as a function of x:

2
ˆ = b + w 1 x + w2 x .

 Though this model implements a quadratic function of its input, the output is still a
linear function of the parameters, so we can still use the normal equations to train
the model in closed form. We can continue to add more powers of x as additional
features, for example to obtain a polynomial of degree 9:

 Machine learning algorithms will generally perform best when their capacity is
appropriate for the true complexity of the task they need to perform and the amount
of training data they are provided with.

 Models with insufficient capacity are unable to solve complex tasks. Models with
high capacity can solve complex tasks, but when their capacity is higher than
needed to solve the present task they may overfit.
 We fit three models to this example training set.
 The training data was generated synthetically by randomly sampling x values
and choosing y deterministically by evaluating a quadratic function.
 (Left)A linear function fit to the data suffers from underfitting - it cannot
capture the curvature that is present in the data.
 (Center)A quadratic function fit to the data generalizes well to unseen
points. It does not suffer from a significant amount of overfitting or
underfitting.
 (Right)A polynomial of degree 9 fit to the data suffers from overfitting. Here
we used the Moore-Penrose pseudoinverse to solve the underdetermined
normal equations.
Figure: Typical relationship between capacity and error.

 Training and test error behave differently. At the left end of the graph,
training error and generalization error are both high. This is the
underfitting regime.
 As we increase capacity, training error decreases, but the gap between
training and generalization error increases.
 Eventually, the size of this gap outweighs the decrease in training error, and
we enter the overfitting regime, where capacity is too large, above the
optimal capacity.

The No Free Lunch Theorem

 Learning theory claims that a machine learning algorithm can generalize


well from a finite training set of examples. This seems to contradict some
basic principles of logic.

 Inductive reasoning or inferring general rules from a limited set of


examples is not logically valid.

 To logically infer a rule describing every member of a set one must have
information about every member of that set.

 In part machine learning avoids this problem by o ffering only probabilistic


rules, rather than the entirely certain rules used in purely logical reasoning.

 Machine learning promises to find rules that are probably correct about
most members of the set they concern.
The no free lunch theorem for machine learning states that averaged over all
possible data generating distributions every classification algorithm has the same
error rate when classifying previously unobserved points.
Figure : The effect of the training dataset size on the train and test error, as well as
on the optimal model capacity.

 We constructed a synthetic regression problem based on adding a moderate


amount of noise to a degree-5 polynomial, generated a single test set, and
then generated several different sizes of training set.

 For each size, we generated 40 different training sets in order to plot error
bars showing 95 percent confidence intervals.

(Top)The MSE on the training and test set for two different models:

 Quadratic model, and


 Model with degree chosen to minimize the test error.
Both are fit in closed form.

For the quadratic model:

o The training error increases as the size of the training set increases. This is
because larger datasets are harder to fit.
o Simultaneously the test error decreases because fewer incorrect hypotheses
are consistent with the training data.
Model with degree chosen to minimize the test error:

o The quadratic model does not have enough capacity to solve the task, so its
test error asymptotes to a high value.
o The test error at optimal capacity asymptotes to the Bayes error.
o The training error can fall below the Bayes error due to the ability of the
training algorithm to memorize specific instances of the training set.
o As the training size increases to infinity, the training error of any fixed-
capacity model (here, the quadratic model) must rise to at least the Bayes
error.

Regularization

 The no free lunch theorem implies that we must design our machine
learning algorithms to perform well on a specific task.

 We do so by building a set of preferences into the learning algorithm.


When these preferences are aligned with the learning problems we ask the
algorithm to solve, it performs better.
 So far, the only method of modifying a learning algorithm that we have
discussed concretely is to increase or decrease the model’s representational
capacity by adding or removing functions from the hypothesis space of
solutions the learning algorithm is able to choose.

 We gave the specific example of increasing or decreasing the degree of a


polynomial for a regression problem. The view we have described so far is
oversimplified.

 The behavior of our algorithm is strongly a ffected not just by how large
we make the set of functions allowed in its hypothesis space, but by the
specific identity of those functions.

 The learning algorithm we have studied so far, linear regression, has a


hypothesis space consisting of the set of linear functions of its input.

 These linear functions can be very useful for problems where the
relationship between inputs and outputs truly is close to linear.

 They are less useful for problems that behave in a very nonlinear
fashion.

For example,
o we can modify the training criterion for linear regression to include weight
decay.

o To perform linear regression with weight decay, we minimize a sum comprising


both the mean squared error on the training and a criterion J (w ) that expresses
a preference for the weights to have smaller squared L2 norm. Specifically,

J (w) = MSEtrain + λww,

o where λ is a value chosen ahead of time that controls the strength of our
preference for smaller weights.

o When λ = 0, we impose no preference, and larger λ forces the weights to


become smaller.
o Minimizing J (w) results in a choice of weights that make a tradeoff
between fitting the training data and being small.

o This gives us solutions that have a smaller slope, or put weight on fewer of
the features.

 We vary the amount of weight decay to prevent these high-degree models


from overfitting.

 (Left)With very large λ, we can force the model to learn a function


with no slope at all. This underfits because it can only represent a
constant function.

 (Center)With a medium value of λ, the learning algorithm recovers a


curve with the right general shape. Even though the model is capable of
representing functions with much more complicated shape, weight decay
has encouraged it to use a simpler function described by smaller
coefficients.

 (Right)With weight decay approaching zero (i.e., using the Moore-


Penrosepseudo inverse to solve the underdetermined problem with
minimal regularization).

 Regularization is any modification we make to a learning algorithm that is


intended to reduce its generalization error but not its training error.

 Regularization is one of the central concerns of the field of machine


learning, rivaled in its importance only by optimization.

You might also like