0% found this document useful (0 votes)

15 views39 pages

Lec 02 03

The document discusses the principles of supervised learning, including the formal model, data generation, and the importance of hypothesis space selection. It emphasizes the bias-variance trade-off and the need for a balance between model complexity and generalization capability. Additionally, it outlines methods for evaluating learning algorithms, such as holdout and k-fold cross-validation, to ensure accurate predictions on unseen data.

Uploaded by

202411073

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views39 pages

Lec 02 03

Uploaded by

202411073

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

IT549: Deep Learning

Lecture 02-03

Choosing a Hypothesis Space

[Inductive Bias, Bias-Variance Trade-off, Model Complexity and Expressiveness Trade-off]

Arpit Rana
3rd / 6th January 2025
Supervised Learning
Problem Settings and Examples
Supervised Learning: A Formal Model

The learner’s input:

● Domain set
An arbitrary set (instance space), X, the set of objects (a.k.a. instances, domain points) we may wish to
label.

● Label set
A set of possible labels, Y. e.g., {0, 1}, {-1, 1}.

● Training data
S = ((x1, y1) . . . (xm, ym)) is ﬁnite sequence of pairs in X x Y, i.e., a sequence of labeled domain points.

The learner’s output:

● A prediction rule, h : X → Y , also called a predictor, a hypothesis, or a classiﬁer.
○ The learner returns h upon receiving the training sequence S.
○ It can be used to predict the label of new domain points (like the past ones).
Supervised Learning: A Formal Model

Data-generation Model:
● Let D be a probability distribution over X x Y, i.e., D is joint probability distribution over domain
points and labels.
○ A distribution Dx over unlabeled domain points (sometimes called marginal distribution),
○ A conditional probability over labels for each domain point, D((x, y) | x).

Independent and Identically Distributed (I.I.D.) Assumption

● Each domain point x has the same prior probability distribution (to be sampled):
P(xi) = P(xi+1) = P(xi+2) = · · · ,
and is independent of the previous examples:
P(xi) = P(xi | xi-1 , xi-2 , . . .) .
Supervised Learning: A Formal Model

More formally, the task of supervised learning can be deﬁned as -

Given a training set (S) of m example input-output pairs,

We call the output y(i) the

ground truth — the true answer
we are asking our model to
predict.

where each pair was generated by an unknown function y = f (x),

discover a function h that approximates the true function f .
Supervised Learning Process

Training
Phase

Inductive Learning: given

Learner a set of observations, it
Hypothesis
ﬁnds a function that is
Space 𝓗 (𝚪: S → h) applicable to the entire
instance space. .

Stationarity:
Follows the
Final Hypothesis or
Model (h) Test
same
Phase
distribution as A Test Instance Prediction
the training
instances.
Choosing a Hypothesis Space
Supervised Learning: Example

Problem: whether to wait for a table at a restaurant.

● Alternate: whether there is a suitable alternative restaurant ● Price: the restaurant’s price range ($, $$, $$$).
nearby. ● Raining: whether it is raining outside.
● Bar: whether the restaurant has a comfortable bar area to wait in. ● Reservation: whether we made a reservation.
● Fri/Sat: true on Fridays and Saturdays. ● Type: the kind of restaurant (French, Italian, Thai, or Burger).
● Hungry: whether we are hungry right now. ● WaitEstimate: host’s wait estimate: 0–10, 10–30, 30–60, or >60
● Patrons: how many people are in the restaurant (values are None, minutes.
Some, and Full).
Supervised Learning: Example

Problem: whether to wait for a table at a restaurant.

. Unknown
Training . Target
Data . function 𝑓
.

Instances

Instance
Space (𝑿)
2 x 2 x 2 x 2 x 3 x 3 x 2 x 2 x 4 x 4 = 9216
Size of Hypothesis Space (| 𝓗 |)
of Boolean Functions
= 29216
Hypothesis Space vs. Hypothesis

What do we mean by a Hypothesis Space (a.k.a. Model Class) and a hypothesis?

There are three different levels of speciﬁcity for using the term Hypothesis or Model:

● a broad hypothesis space (like “polynomials”),

● a hypothesis space with hyperparameters ﬁlled in (like “degree-2 polynomials”), and

● a speciﬁc hypothesis with all parameters ﬁlled in (like 5x2 + 3x − 2).

Hypothesis Space vs. Hypothesis

What do we mean by a Hypothesis Space (a.k.a. Model Class) and a hypothesis?

There are three different levels of speciﬁcity for using the term Hypothesis or Model:

Hyperparameter:
Polynomials degree=1

Parameters:
a=2, b=3
Hypothesis Space vs. Hypothesis

How do we choose a good Hypothesis Space or Model Class?

Hyperparameter:
Polynomials degree=1

Parameters:
a=2, b=3

Hypothesis Space / Representation / Model Class Selection Optimization

(popularly known as Model Selection) or Training
Hypothesis Space Selection is Subjective

Most probable hypothesis given the data -

≣
𝒉∈𝓗 𝒉∈𝓗

● We can say that the prior probability P(h) is high for a smooth degree-1 or -2 polynomial
and lower for a degree-12 polynomial with large, sharp spikes.
Hypothesis Space Selection is Subjective

The observed dataset S alone does not allow us to make conclusions about unseen instances.
We need to make some assumptions!

● These assumptions induce the bias (a.k.a. inductive or learning bias) of a learning
algorithm.

● Two ways to induce bias:

○ Restriction: Limit the hypothesis space (e.g., degree-2 polynomials)
○ Preference: Impose ordering on hypothesis space (e.g., prefer simpler than complex)
Hypothesis Space Selection is not only subjective but is empirical also.

● Part of hypothesis space selection is qualitative and subjective:

We might select polynomials rather than decision trees based on something that we
know about the problem,

and

● part is quantitative and empirical:

Within the class of polynomials, we might select Degree = 2, because that value performs
best on the validation data set.
Experimental Evaluation of Learning Algorithms

The overall objective of the Learning

Algorithm is to ﬁnd a hypothesis that -

● is consistent (i.e., ﬁts the training

data), but more importantly,

● generalizes well for previously

unseen data. Hypothesis Learner
Space 𝓗 (𝚪: S → h)

Experimental Evaluation deﬁnes ways

to Measure the Generalizability of a
Learning Algorithm.
Final Hypothesis or
Model (h)
Experimental Evaluation of Learning Algorithms

Sample Error
The sample error of hypothesis h with respect to the target function f and data sample S is:

It is impossible to asses
true error, so we try to
estimate it using sample
error.
True Error

The true error of hypothesis h with respect to the target function f and the distribution D is
the probability that h will misclassify an instance drawn at random according to D:
Generalization Error

Generalization error (a.k.a. out-of-sample error) is a measure of how accurately an algorithm is

able to predict outcome values for previously unseen data.

Variance Bias Irreducible Error

Due to the model’s Due to Wrong Assumptions. Restrictions Due to the noisiness of
sensitivity to small imposed by - the data itself.
variations in the
training data. The Representation Function (i.e., Hypothesis The only way to
space, such as, linear or quadratic) handle it is to clean up
It leads to overﬁtting! the data properly,
The Search Algorithm (e.g., Grid search or Beam
search) detect and remove
outliers.
It leads to underﬁtting!
Choosing a Hypothesis Space - I

One way to analyze hypothesis spaces is by

● the bias they impose (regardless of the training data set), and

● the variance they produce (from one training set to another).

Bias

The tendency of a predictive hypothesis to deviate from the expected value when averaged
over different training sets.

● Bias often results from restrictions

imposed by the hypothesis space.

● We say that a hypothesis is

underﬁtting when it fails to ﬁnd a
pattern in the data.
Variance

The amount of change in the hypothesis due to ﬂuctuation in the training data.

● We say a function is overﬁtting the data

when it pays too much attention to the
particular data set it is trained on.

● It causes the hypothesis to perform

poorly on unseen data.
Bias–Variance Trade-off

● High Variance-High Bias

The model is inconsistent and also
inaccurate on average

● Low Variance-High Bias

Models are consistent but low on
average

● High Variance-Low Bias

Somewhat accurate but inconsistent on
average

● Low Variance-Low Bias

Model is consistent and accurate on
average
Analogy with throwing darts at a board.
Choosing a Hypothesis Space - II

Another way to analyze hypothesis spaces is by

● the expressiveness (i.e., ability of a model to represent a wide variety of functions or

patterns) of a hypothesis space, and
○ Can be measured by the size of the hypothesis space

● the model complexity (i.e., how intricate the relationships a model can capture) of a
hypothesis space.
○ Can be estimated by the number of parameters of a hypothesis

Note-1: Sometimes the term model capacity is used to refer to model complexity and
expressiveness together.
Note-2: In general, the required amount of training data depends on the model complexity,
representativeness of the training sample, and the acceptable error margin.
Choosing a Hypothesis Space - II

There is a tradeoff between the expressiveness of a hypothesis space and the computational
complexity of ﬁnding a good hypothesis within that space.

● Fitting a straight line to data is an easy computation; ﬁtting high-degree polynomials is

somewhat harder; and ﬁtting unusual-looking functions may be undecidable.

● After learning h, computing h(x) when h is a linear function is guaranteed to be fast, while
computing an arbitrarily complex function may not even guaranteed to terminate.

For example:
● In Deep Learning, representations are not simple but the h(x) computation still takes
only a bounded number of steps to compute with appropriate hardware.
Bias-Variance vs. Model’s Complexity

The relationship between bias and variance is closely related to the machine learning concepts
of overﬁtting, underﬁtting, and model’s complexity.

● Increasing a model’s complexity

typically increases its variance and
reduces its bias.
● Reducing a model’s complexity
increases its bias and reduces its
variance.

This is why it is called a tradeoff.

Optimal Model’s
complexity complexity
Learning as a Search

Given a hypothesis space, data, and a bias, the problem of learning can be reduced to one of
search.

Hypothesis Learner
Space 𝓗 (𝚪: S → h)

Final Hypothesis or
Model (h)
A Test Instance Prediction
Evaluation
Generalizing to Unseen Data

The error on the training set is called the training error (a.k.a. resubstitution error and
in-sample error).

● The training error is not, in general a good indicator of performance on unseen data. It's
often too optimistic.

● Why?
Generalizing to Unseen Data

To predict future performance, we need to measure error on an independent dataset:

● We want a dataset that has played no part in creating the model.

● This second dataset is called the test set.

● The error on the test set is called the test error (a.k.a. out-of-sample error and
extra-sample error).

Given a sample data S, there are methodologies to better approximate the true error of the
model.
Holdout Method

● Shufﬂe the dataset and partition it into two disjoint sets:

Dataset
○ training set (e.g., 80% of the full dataset); and

○ test set (the rest of the full dataset).

Shuﬄed Dataset
● Train the estimator on the training set.

● Test the model (evaluate the predictions) on the test set. Train Test

It is essential that the test set is not used in any way to create the model. Don't even look at it!
● 'Cheating' is called leakage.
● 'Cheating' is one cause of overﬁtting
Holdout Method: Class Exercise

Standardization, as we know, is about scaling the data. It requires calculation of the mean and
standard deviation.

When should the mean and standard deviation be calculated? And Why?
(a) before splitting, on the entire dataset, or
(b) after splitting, on just the training set, or
(c) after splitting, on just the test set, or
(d) after splitting, on the training and test sets separately,

What to do when the model is deployed?

Facts about Holdout Method

● The disadvantages of this method are:

○ Results can vary quite a lot across different runs.
○ Informally, you might get lucky — or unlucky
i.e., in any one split, the data used for training or testing might not be
representative.

● We are training on only a subset of the available dataset, perhaps as little as 50% of it.
From so little data, we may learn a worse model and so our error measurement may be
pessimistic.

● In practice, we only use the holdout method when we have a very large dataset. The size
of the dataset mitigates the above problems.

● When we have a smaller dataset, we use a resampling method:

○ The examples get re-used for training and testing.
K-fold Cross-Validation Method

The most-used resampling method is k-fold cross-validation:

● Shufﬂe the dataset and partition it into k disjoint subsets of equal size.
○ Each of the partitions is called a fold.
○ Typically, k=10, so you have 10 folds.

● You take each fold in turn and use it as the test set, training the learner on the remaining
folds.

● Clearly, you can do this k times, so that each fold gets 'a turn' at being the test set.
○ By this method, each example is used exactly once for testing, and k-1 times for
training.
K-fold Cross-Validation: Pseudocode

● Shuﬄe the dataset D and partition it into k k= 5 folds

disjoint equal-sized subsets, D1, ... ,Dk

● for i = 1 to k: Test

○ train on D \ Di
Test
○ make predictions for Di
○ measure error (e.g. MAE) .
.
.
● Report the mean of the errors
Test
Facts about K-fold Cross-Validation

● The disadvantages of this method are:

○ The number of folds is constrained by the size of the dataset and the desire
sometimes on the part of statisticians to have folds of at least 30 examples.

○ It can be costly to train the learning algorithm k times.

○ There may still be some variability in the results due to 'lucky'/'unlucky' splits.

● The extreme is k = n, also known as leave-one-out cross-validation or LOOCV.

Nested K-fold Cross-Validation Method

In case of hyperparameter (parameters of the model class, Dataset

not of the individual model) or parameter tuning, we
partition the whole dataset into three disjoint sets:
Shuﬄed Dataset
● A training set to train candidate models.

● A validation set, (a.k.a. a development set or dev set)

Train Dev. Test
to evaluate the candidate models and choose the
best one.
Train and Select
● A test set to do a ﬁnal unbiased evaluation of the the best model
best model.
Merge, Train and
Test the model
K-fold Cross-Validation can be applied to validation set
(inner CV) and test set (outer CV) in a nested way. Merge, Train and
Deploy the model
Model’s Performance

Training high
Underﬁtting
Error

low

Validation high
Overﬁtting
Error

low

Test high I.I.D.

Error Violation

low

Good Model
Model’s Performance

Training high
Underﬁtting
Error

low Underﬁtting Overﬁtting

Need More Complex Need Simpler Model

high Model
Validation
Overﬁtting
Error
Need Less Regularization Need More Regularization

low Need More Features Remove Extra Features

More Data Doesn’t Work Need More Data

Test high I.I.D.
Error Violation

low

Good Model
Evaluation - II & III
Metrics and Loss Functions (DIY)

Framework For Artificial Intelligence
100% (5)
Framework For Artificial Intelligence
44 pages
Lecture 8
No ratings yet
Lecture 8
23 pages
Hypothesis in ML
No ratings yet
Hypothesis in ML
8 pages
Hypothesis Space and Inductive Bias - Inductive Bias - Inductive Learning - Underfitting and Overfitting
No ratings yet
Hypothesis Space and Inductive Bias - Inductive Bias - Inductive Learning - Underfitting and Overfitting
4 pages
Formalizing Supervised Learning Model Selection
No ratings yet
Formalizing Supervised Learning Model Selection
1 page
Inductive Bias, Hypothesis, Hypothesis Space, Variance
No ratings yet
Inductive Bias, Hypothesis, Hypothesis Space, Variance
23 pages
Inductive Bias Hypothesis Hypothesis Space Variance
No ratings yet
Inductive Bias Hypothesis Hypothesis Space Variance
12 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
27 pages
Machine Learning Coms-4771: Alina Beygelzimer Tony Jebara, John Langford, Cynthia Rudin
No ratings yet
Machine Learning Coms-4771: Alina Beygelzimer Tony Jebara, John Langford, Cynthia Rudin
17 pages
Machine Leaning 1 Unit
No ratings yet
Machine Leaning 1 Unit
10 pages
ML Unit-2
No ratings yet
ML Unit-2
23 pages
ML Unit1 6
No ratings yet
ML Unit1 6
3 pages
What Is Supervise
No ratings yet
What Is Supervise
3 pages
Unit 5
No ratings yet
Unit 5
21 pages
Lec 3
No ratings yet
Lec 3
21 pages
Inductive Bias
No ratings yet
Inductive Bias
9 pages
Notes
No ratings yet
Notes
125 pages
UNIT I-Part 2
No ratings yet
UNIT I-Part 2
35 pages
19 ML Intro
No ratings yet
19 ML Intro
33 pages
Sec 1630
No ratings yet
Sec 1630
145 pages
19 ML Intro
No ratings yet
19 ML Intro
31 pages
Ai Unit V
No ratings yet
Ai Unit V
18 pages
UNIT I 4 ML Hypothesis & Concept Learning
No ratings yet
UNIT I 4 ML Hypothesis & Concept Learning
69 pages
Week 3
No ratings yet
Week 3
43 pages
Presentation On ML
No ratings yet
Presentation On ML
469 pages
Lecture 3 Hypothesis Space & Inductive Bias
No ratings yet
Lecture 3 Hypothesis Space & Inductive Bias
29 pages
Unit 2
No ratings yet
Unit 2
97 pages
Chap 18
No ratings yet
Chap 18
51 pages
Unit 1
No ratings yet
Unit 1
20 pages
Week 3
No ratings yet
Week 3
56 pages
AI Unit 4
No ratings yet
AI Unit 4
91 pages
Ai Unit5 Learning
No ratings yet
Ai Unit5 Learning
62 pages
ML 1 2 3
No ratings yet
ML 1 2 3
54 pages
Chapter 6:artificial Intelligence Learning: By. Getaneh T
No ratings yet
Chapter 6:artificial Intelligence Learning: By. Getaneh T
59 pages
ML Sit1305
No ratings yet
ML Sit1305
127 pages
Chapter19 4e
No ratings yet
Chapter19 4e
67 pages
Machine - Learning (Unit 3)
No ratings yet
Machine - Learning (Unit 3)
9 pages
Unit 2
No ratings yet
Unit 2
76 pages
Machine Learning Moudle - 1: There Are Three Main Types of Machine Learning
No ratings yet
Machine Learning Moudle - 1: There Are Three Main Types of Machine Learning
86 pages
Supervised Learning
No ratings yet
Supervised Learning
5 pages
Unit 1-1
No ratings yet
Unit 1-1
75 pages
KSMF
No ratings yet
KSMF
35 pages
ML 01
No ratings yet
ML 01
24 pages
DSA5105 Lecture1
No ratings yet
DSA5105 Lecture1
51 pages
ChatGPT - Machine Learning Overview
No ratings yet
ChatGPT - Machine Learning Overview
34 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Lecture+Notes+Model+ Selection PDF
No ratings yet
Lecture+Notes+Model+ Selection PDF
12 pages
UNIT-VI Learning
No ratings yet
UNIT-VI Learning
19 pages
Computer Network: 02 December 2024 22:38
No ratings yet
Computer Network: 02 December 2024 22:38
5 pages
07 Intro To ML
No ratings yet
07 Intro To ML
38 pages
Chapter 01 Introduction To Machine Learning
No ratings yet
Chapter 01 Introduction To Machine Learning
59 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
DSA5102X Lecture1
No ratings yet
DSA5102X Lecture1
51 pages
Cs 171 18 IntroLearning Old
No ratings yet
Cs 171 18 IntroLearning Old
47 pages
2-Inductive Learning
No ratings yet
2-Inductive Learning
37 pages
DSA5102 Lecture1
No ratings yet
DSA5102 Lecture1
60 pages
Unit-I Machine Learning Basics
No ratings yet
Unit-I Machine Learning Basics
85 pages
Csa202 Unit 2
No ratings yet
Csa202 Unit 2
36 pages
ML Notes
No ratings yet
ML Notes
49 pages
COMP2050-Lecture 22 - Machine Learning
No ratings yet
COMP2050-Lecture 22 - Machine Learning
47 pages
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Combined FDS PPT - Prof Arindam Roy Lectures 1-6
No ratings yet
Combined FDS PPT - Prof Arindam Roy Lectures 1-6
134 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
All That Glitters Is Not Gold - Comparing Backtest and Out-of-Sample Performance On A Large Cohort o
No ratings yet
All That Glitters Is Not Gold - Comparing Backtest and Out-of-Sample Performance On A Large Cohort o
19 pages
MLT Unit 2 Notes
No ratings yet
MLT Unit 2 Notes
58 pages
Introduction To Machine Learning IIT KGP Week 2
100% (1)
Introduction To Machine Learning IIT KGP Week 2
14 pages
Sound Deposit Insurance Pricing Using A Machine Le
No ratings yet
Sound Deposit Insurance Pricing Using A Machine Le
18 pages
ML On Env Issuex
No ratings yet
ML On Env Issuex
17 pages
Data-Driven Approach To Predict The Flow Boiling Heat Transfer Coefficient of Liquid Hydrogen Aviation Fuel
No ratings yet
Data-Driven Approach To Predict The Flow Boiling Heat Transfer Coefficient of Liquid Hydrogen Aviation Fuel
9 pages
DeepFake Audio Detection
100% (1)
DeepFake Audio Detection
16 pages
Customer Churn Prediction Employing Ensemble Learning
No ratings yet
Customer Churn Prediction Employing Ensemble Learning
5 pages
Step-by-Step Machine Learning
No ratings yet
Step-by-Step Machine Learning
3 pages
Predictive Modelling of Crime Dataset Using Data Mining
No ratings yet
Predictive Modelling of Crime Dataset Using Data Mining
16 pages
Machine Learning Approaches and Sentinel-2 Data in Crop Type Mapping
No ratings yet
Machine Learning Approaches and Sentinel-2 Data in Crop Type Mapping
21 pages
Prediction of Risk Delay in Construction Projects Using A Hybrid Artificial Intelligence Model
No ratings yet
Prediction of Risk Delay in Construction Projects Using A Hybrid Artificial Intelligence Model
14 pages
Soil Spectroscopy: Training Material
100% (1)
Soil Spectroscopy: Training Material
28 pages
Section 1: Cross-Validation and Model Performance
No ratings yet
Section 1: Cross-Validation and Model Performance
33 pages
(Ebook) Machine Learning For Algorithmic Trading by Stefan Jansen Instant Download
100% (3)
(Ebook) Machine Learning For Algorithmic Trading by Stefan Jansen Instant Download
63 pages
Dropout Vs Pruning
No ratings yet
Dropout Vs Pruning
2 pages
Predicting and Interpreting Student Performance Using Ensemble Models and Shapley Additive Explanations
No ratings yet
Predicting and Interpreting Student Performance Using Ensemble Models and Shapley Additive Explanations
16 pages
Chapter-2 (Deep Learning)
No ratings yet
Chapter-2 (Deep Learning)
18 pages
Module 3 EDA
No ratings yet
Module 3 EDA
14 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
Ai900 Cert1
No ratings yet
Ai900 Cert1
23 pages
Molina-Garip 2019 Socarxiv
No ratings yet
Molina-Garip 2019 Socarxiv
27 pages
AI Glossary of Key Terms
No ratings yet
AI Glossary of Key Terms
15 pages
6.interpretable Hardness Prediction of High-Entropy Alloys Through Ensemble Learning
No ratings yet
6.interpretable Hardness Prediction of High-Entropy Alloys Through Ensemble Learning
13 pages
5th Sem Syllabus Autonomy
No ratings yet
5th Sem Syllabus Autonomy
28 pages
Project 2
No ratings yet
Project 2
2 pages
Start Here With Machine Learning
No ratings yet
Start Here With Machine Learning
25 pages