0% found this document useful (0 votes)

10 views

4.machine Learning Basics (C)

Deep Learning

Uploaded by

Kavitha

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

4.machine Learning Basics (C)

Deep Learning

Uploaded by

Kavitha

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

4.

MACHINE LEARNING BASICS: CAPACITY OVERFITTING AND UNDERFITTING

 The central challenge in machine learning is that we must perform well on new,
previously unseen inputs—not just those on which our model was trained. The
ability to perform well on previously unobserved inputs is called generalization.

 When training a machine learning model, we have access to a training set, we can
compute some error measure on the training set called the training error, and we
reduce this training error. So far, what we have described is simply an
optimization problem. What separates machine learning from optimization is that
we want the generalization error, also called the test error, to be low as well.

 The generalization error is defined as the expected value of the error on a new
input. Here the expectation is taken across di ﬀerent possible inputs, drawn from
the distribution of inputs we expect the system to encounter in practice.

 We typically estimate the generalization error of a machine learning model by

measuring its performance on a test set of examples that were collected separately
from the training set.

 In our linear regression example we trained the model by minimizing the training
error

How can
we affect performance on the test set when we get to observe only the training set?

 The field of statistical learning theory provides some answers. If the training and the test
set are collected arbitrarily, there is indeed little we can do. If we are allowed to make
some assumptions about how the training and test set are collected, then we can make
some progress.

 The train and test data are generated by a probability distribution over datasets called
the data generating process. We typically make a set of assumptions known
collectively as the assumptions.

 These assumptions are that the examples in each dataset are independent from each
other, and that the train set and test set are identically distributed, drawn from the
same probability distribution as each other. This assumption allows us to describe the
data generating process with a probability distribution over a single example. The
same distribution is then used to generate every train example and every test example.
 We call that shared underlying distribution the data generating distribution, denoted
pdata. This probabilistic framework and the i.e. assumptions allow us to
mathematically study the relationship between training error and test error.

 When we use a machine learning algorithm, we do not fix the parameters ahead of
time, then sample both datasets. We sample the training set, then use it to choose
the parameters to reduce training set error, then sample the test set. Under this
process, the expected test error is greater than or equal to the expected value of
training error.

The factors determining how well a machine learning algorithm will perform are its
ability to:

 Make the training error small.

 Make the gap between training and test error small.
These two factors correspond to the two central challenges in machine learning:
underﬁtting and overﬁtting .

 Underfitting occurs when the model is not able to obtain a suﬃciently low error
value on the training set.

 Overfitting occurs when the gap between the training error and test error is
too large.

We can control whether a model is more likely to overfit or underfit by altering its
capacity.

 Informally, a model’s capacity is its ability to fit a wide variety of functions.

 Models with low capacity may struggle to fit the training set.

 Models with high capacity can overfit by memorizing properties of the training set
that do not serve them well on the test set.

One way to control the capacity of a learning algorithm is by choosing its hypothesis
space, the set of functions that the learning algorithm is allowed to select as being the
solution.
For example:

 The linear regression algorithm has the set of all linear functions of its input as its
hypothesis space.

 A polynomial of degree one gives us the linear regression model with which we are
already familiar with prediction

 yˆ = b + wx.

 By introducing x2 as another feature provided to the linear regression model, we

can learn a model that is quadratic as a function of x:

2
ˆ = b + w 1 x + w2 x .

 Though this model implements a quadratic function of its input, the output is still a
linear function of the parameters, so we can still use the normal equations to train
the model in closed form. We can continue to add more powers of x as additional
features, for example to obtain a polynomial of degree 9:

 Machine learning algorithms will generally perform best when their capacity is
appropriate for the true complexity of the task they need to perform and the amount
of training data they are provided with.

 Models with insufficient capacity are unable to solve complex tasks. Models with
high capacity can solve complex tasks, but when their capacity is higher than
needed to solve the present task they may overfit.
 We fit three models to this example training set.
 The training data was generated synthetically by randomly sampling x values
and choosing y deterministically by evaluating a quadratic function.
 (Left)A linear function fit to the data suffers from underfitting - it cannot
capture the curvature that is present in the data.
 (Center)A quadratic function fit to the data generalizes well to unseen
points. It does not suffer from a significant amount of overfitting or
underfitting.
 (Right)A polynomial of degree 9 fit to the data suffers from overfitting. Here
we used the Moore-Penrose pseudoinverse to solve the underdetermined
normal equations.
Figure: Typical relationship between capacity and error.

 Training and test error behave differently. At the left end of the graph,
training error and generalization error are both high. This is the
underfitting regime.
 As we increase capacity, training error decreases, but the gap between
training and generalization error increases.
 Eventually, the size of this gap outweighs the decrease in training error, and
we enter the overfitting regime, where capacity is too large, above the
optimal capacity.

The No Free Lunch Theorem

 Learning theory claims that a machine learning algorithm can generalize

well from a finite training set of examples. This seems to contradict some
basic principles of logic.

 Inductive reasoning or inferring general rules from a limited set of

examples is not logically valid.

 To logically infer a rule describing every member of a set one must have
information about every member of that set.

 In part machine learning avoids this problem by o ﬀering only probabilistic

rules, rather than the entirely certain rules used in purely logical reasoning.

 Machine learning promises to find rules that are probably correct about
most members of the set they concern.
The no free lunch theorem for machine learning states that averaged over all
possible data generating distributions every classification algorithm has the same
error rate when classifying previously unobserved points.
Figure : The eﬀect of the training dataset size on the train and test error, as well as
on the optimal model capacity.

 We constructed a synthetic regression problem based on adding a moderate

amount of noise to a degree-5 polynomial, generated a single test set, and
then generated several diﬀerent sizes of training set.

 For each size, we generated 40 diﬀerent training sets in order to plot error
bars showing 95 percent confidence intervals.

(Top)The MSE on the training and test set for two diﬀerent models:

 Quadratic model, and

 Model with degree chosen to minimize the test error.
Both are fit in closed form.

For the quadratic model:

o The training error increases as the size of the training set increases. This is
because larger datasets are harder to fit.
o Simultaneously the test error decreases because fewer incorrect hypotheses
are consistent with the training data.
Model with degree chosen to minimize the test error:

o The quadratic model does not have enough capacity to solve the task, so its
test error asymptotes to a high value.
o The test error at optimal capacity asymptotes to the Bayes error.
o The training error can fall below the Bayes error due to the ability of the
training algorithm to memorize specific instances of the training set.
o As the training size increases to infinity, the training error of any fixed-
capacity model (here, the quadratic model) must rise to at least the Bayes
error.

Regularization

 The no free lunch theorem implies that we must design our machine
learning algorithms to perform well on a specific task.

 We do so by building a set of preferences into the learning algorithm.

When these preferences are aligned with the learning problems we ask the
algorithm to solve, it performs better.
 So far, the only method of modifying a learning algorithm that we have
discussed concretely is to increase or decrease the model’s representational
capacity by adding or removing functions from the hypothesis space of
solutions the learning algorithm is able to choose.

 We gave the specific example of increasing or decreasing the degree of a

polynomial for a regression problem. The view we have described so far is
oversimplified.

 The behavior of our algorithm is strongly a ﬀected not just by how large
we make the set of functions allowed in its hypothesis space, but by the
specific identity of those functions.

 The learning algorithm we have studied so far, linear regression, has a

hypothesis space consisting of the set of linear functions of its input.

 These linear functions can be very useful for problems where the
relationship between inputs and outputs truly is close to linear.

 They are less useful for problems that behave in a very nonlinear
fashion.

For example,
o we can modify the training criterion for linear regression to include weight
decay.

o To perform linear regression with weight decay, we minimize a sum comprising

both the mean squared error on the training and a criterion J (w ) that expresses
a preference for the weights to have smaller squared L2 norm. Specifically,

J (w) = MSEtrain + λww,

o where λ is a value chosen ahead of time that controls the strength of our
preference for smaller weights.

o When λ = 0, we impose no preference, and larger λ forces the weights to

become smaller.
o Minimizing J (w) results in a choice of weights that make a tradeoﬀ
between fitting the training data and being small.

o This gives us solutions that have a smaller slope, or put weight on fewer of
the features.

 We vary the amount of weight decay to prevent these high-degree models

from overfitting.

 (Left)With very large λ, we can force the model to learn a function

with no slope at all. This underfits because it can only represent a
constant function.

 (Center)With a medium value of λ, the learning algorithm recovers a

curve with the right general shape. Even though the model is capable of
representing functions with much more complicated shape, weight decay
has encouraged it to use a simpler function described by smaller
coeﬃcients.

 (Right)With weight decay approaching zero (i.e., using the Moore-

Penrosepseudo inverse to solve the underdetermined problem with
minimal regularization).

 Regularization is any modiﬁcation we make to a learning algorithm that is

intended to reduce its generalization error but not its training error.

 Regularization is one of the central concerns of the field of machine

learning, rivaled in its importance only by optimization.

Test Bank Introductory Econometrics A Modern Approach 5th Edit
100% (1)
Test Bank Introductory Econometrics A Modern Approach 5th Edit
113 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Employee Attrition Study Case
No ratings yet
Employee Attrition Study Case
88 pages
Probit Analysis MiniTab - Waktu (LT50)
100% (1)
Probit Analysis MiniTab - Waktu (LT50)
3 pages
Unit 2
No ratings yet
Unit 2
18 pages
ML 5
No ratings yet
ML 5
14 pages
DL_Unit1 (1)
No ratings yet
DL_Unit1 (1)
79 pages
ML MU Unit 2
100% (2)
ML MU Unit 2
42 pages
DSOST3
No ratings yet
DSOST3
31 pages
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
100% (2)
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
26 pages
Ensemble Method
No ratings yet
Ensemble Method
12 pages
Regularization
No ratings yet
Regularization
18 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
8 pages
Deep Learning Curve 1693642530
No ratings yet
Deep Learning Curve 1693642530
10 pages
DEEP LEARNING UNIT 3
No ratings yet
DEEP LEARNING UNIT 3
19 pages
Validation Over Under Fir Unit 5
No ratings yet
Validation Over Under Fir Unit 5
6 pages
5.hyperparameters and Validation Sets (C)
No ratings yet
5.hyperparameters and Validation Sets (C)
3 pages
Emailing PREDICTIVE ANALYSIS 2
No ratings yet
Emailing PREDICTIVE ANALYSIS 2
14 pages
ML 1-6
No ratings yet
ML 1-6
248 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
8 pages
ML ans
No ratings yet
ML ans
18 pages
Unit 5 Learning
No ratings yet
Unit 5 Learning
21 pages
Learning Best Practices For Model Evaluation and Hyperparameter Tuning
No ratings yet
Learning Best Practices For Model Evaluation and Hyperparameter Tuning
17 pages
Unit 4
No ratings yet
Unit 4
35 pages
Nndl Notes
No ratings yet
Nndl Notes
73 pages
Receiver Operator Characteristic
No ratings yet
Receiver Operator Characteristic
25 pages
A "Short" Introduction To Model Selection
No ratings yet
A "Short" Introduction To Model Selection
25 pages
Complete ML Notes
No ratings yet
Complete ML Notes
62 pages
Machine Learning An Algorithmic Perspective (2nd Ed) - 40-42
No ratings yet
Machine Learning An Algorithmic Perspective (2nd Ed) - 40-42
3 pages
Machine Leafning
No ratings yet
Machine Leafning
5 pages
ML Unit 2
No ratings yet
ML Unit 2
18 pages
MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
9 pages
Machine Learning and Pattern Recognition Week 2
No ratings yet
Machine Learning and Pattern Recognition Week 2
7 pages
Ensemble Learning
100% (1)
Ensemble Learning
7 pages
ML Answerbank
No ratings yet
ML Answerbank
14 pages
Ml Unit4 Notes
No ratings yet
Ml Unit4 Notes
20 pages
ML Module 1 + Module 2
No ratings yet
ML Module 1 + Module 2
4 pages
Introduction
No ratings yet
Introduction
41 pages
Interview Questions On Machine Learning
100% (4)
Interview Questions On Machine Learning
22 pages
ML Questions 2021
100% (1)
ML Questions 2021
26 pages
Lecture 4 Machine Learning - Bcsc
No ratings yet
Lecture 4 Machine Learning - Bcsc
45 pages
mlt 2021-22
No ratings yet
mlt 2021-22
14 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
40 Interview Questions On Machine Learning From Analytics Vidhya
No ratings yet
40 Interview Questions On Machine Learning From Analytics Vidhya
14 pages
hypothesis_in_ml
No ratings yet
hypothesis_in_ml
8 pages
Ensemble methods_b45145f8047e51ea0d65d32fc07eb528
No ratings yet
Ensemble methods_b45145f8047e51ea0d65d32fc07eb528
21 pages
Machine Learning - Exploring The Model
No ratings yet
Machine Learning - Exploring The Model
2 pages
Overfitting
No ratings yet
Overfitting
7 pages
ML - WEEK 06
No ratings yet
ML - WEEK 06
31 pages
11 July Unit 1 - Copy
No ratings yet
11 July Unit 1 - Copy
47 pages
Jkkklphftbbhuii
No ratings yet
Jkkklphftbbhuii
17 pages
Guide
No ratings yet
Guide
24 pages
Article Review 9 Eng
No ratings yet
Article Review 9 Eng
21 pages
Machine Learning Interview Questions PDF
No ratings yet
Machine Learning Interview Questions PDF
14 pages
Data Science Interview Questions (#Day9)
No ratings yet
Data Science Interview Questions (#Day9)
9 pages
Capstone Project
No ratings yet
Capstone Project
6 pages
U&O Fitting
No ratings yet
U&O Fitting
6 pages
Bias and Variance
No ratings yet
Bias and Variance
7 pages
Learnin
No ratings yet
Learnin
9 pages
Unit V
No ratings yet
Unit V
12 pages
Machine Learning Models: by Mayuri Bhandari
No ratings yet
Machine Learning Models: by Mayuri Bhandari
48 pages
ML Notes (Module-3)
No ratings yet
ML Notes (Module-3)
21 pages
Lec8 (1)
No ratings yet
Lec8 (1)
19 pages
Memory Technology
No ratings yet
Memory Technology
26 pages
Trends in Power and Energy in Integrated Circuits
No ratings yet
Trends in Power and Energy in Integrated Circuits
21 pages
Depende Bali Ty
No ratings yet
Depende Bali Ty
9 pages
7& 9 Autoencoder and Variational Autoencoder
No ratings yet
7& 9 Autoencoder and Variational Autoencoder
13 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
SPSS Statistics, Factor Analysis, Cluster Analysis, LInear Regression, HEC, Xavier Boute, MBA
No ratings yet
SPSS Statistics, Factor Analysis, Cluster Analysis, LInear Regression, HEC, Xavier Boute, MBA
8 pages
SPSS 16.0 Tutorial To Develop A Regression Model
No ratings yet
SPSS 16.0 Tutorial To Develop A Regression Model
12 pages
LP III Lab Manual
100% (1)
LP III Lab Manual
8 pages
Lecture 1
No ratings yet
Lecture 1
26 pages
Lecture-2 Least Squares Regression
No ratings yet
Lecture-2 Least Squares Regression
18 pages
ECON 330 Problem Sets
No ratings yet
ECON 330 Problem Sets
3 pages
Wool Dridge
No ratings yet
Wool Dridge
149 pages
Research Project Report (Online Taxi Service)
No ratings yet
Research Project Report (Online Taxi Service)
35 pages
Mas 9701 Cost Concepts and Analysis
No ratings yet
Mas 9701 Cost Concepts and Analysis
10 pages
Econometrics: Dr. Sayyid Salman Rizavi
0% (1)
Econometrics: Dr. Sayyid Salman Rizavi
23 pages
UTS Susulan
No ratings yet
UTS Susulan
4 pages
# Tommy Trojan # ITP 449 Fall 2021 # Final Project # Q1
No ratings yet
# Tommy Trojan # ITP 449 Fall 2021 # Final Project # Q1
6 pages
Understanding Order Flow Volatility
No ratings yet
Understanding Order Flow Volatility
18 pages
What's Trending in Difference-In-Differences
No ratings yet
What's Trending in Difference-In-Differences
27 pages
Jau-Lian Jeng - Empirical Asset Pricing Models-Chap 1
No ratings yet
Jau-Lian Jeng - Empirical Asset Pricing Models-Chap 1
41 pages
Africa2023 - Session 15 - Kadapawo
No ratings yet
Africa2023 - Session 15 - Kadapawo
16 pages
Machine Learning in 10 Pages PDF
No ratings yet
Machine Learning in 10 Pages PDF
10 pages
Adrijana Torbovska
No ratings yet
Adrijana Torbovska
6 pages
678-Article Text-2038-2-10-20210329
No ratings yet
678-Article Text-2038-2-10-20210329
17 pages
Kinetics Excel
No ratings yet
Kinetics Excel
14 pages
Case Study 22 QA Business Quantitative Analysis
No ratings yet
Case Study 22 QA Business Quantitative Analysis
7 pages
Chapter6 Vortex-Detection - Strategies
No ratings yet
Chapter6 Vortex-Detection - Strategies
17 pages
CropStat Tutorial Part 1
100% (3)
CropStat Tutorial Part 1
249 pages
Big Data Syllabus
No ratings yet
Big Data Syllabus
17 pages
Logistic Regression in R
No ratings yet
Logistic Regression in R
19 pages
IPL Score Prediction (Journal) - 4nm18cs142-169-191-215.
No ratings yet
IPL Score Prediction (Journal) - 4nm18cs142-169-191-215.
10 pages
1694600777-Unit2.2 Logistic Regression CU 2.0
100% (1)
1694600777-Unit2.2 Logistic Regression CU 2.0
37 pages