0% found this document useful (0 votes)
15 views39 pages

Lec 02 03

The document discusses the principles of supervised learning, including the formal model, data generation, and the importance of hypothesis space selection. It emphasizes the bias-variance trade-off and the need for a balance between model complexity and generalization capability. Additionally, it outlines methods for evaluating learning algorithms, such as holdout and k-fold cross-validation, to ensure accurate predictions on unseen data.

Uploaded by

202411073
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views39 pages

Lec 02 03

The document discusses the principles of supervised learning, including the formal model, data generation, and the importance of hypothesis space selection. It emphasizes the bias-variance trade-off and the need for a balance between model complexity and generalization capability. Additionally, it outlines methods for evaluating learning algorithms, such as holdout and k-fold cross-validation, to ensure accurate predictions on unseen data.

Uploaded by

202411073
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

IT549: Deep Learning

Lecture 02-03

Choosing a Hypothesis Space


[Inductive Bias, Bias-Variance Trade-off, Model Complexity and Expressiveness Trade-off]

Arpit Rana
3rd / 6th January 2025
Supervised Learning
Problem Settings and Examples
Supervised Learning: A Formal Model

The learner’s input:


● Domain set
An arbitrary set (instance space), X, the set of objects (a.k.a. instances, domain points) we may wish to
label.

● Label set
A set of possible labels, Y. e.g., {0, 1}, {-1, 1}.

● Training data
S = ((x1, y1) . . . (xm, ym)) is finite sequence of pairs in X x Y, i.e., a sequence of labeled domain points.

The learner’s output:


● A prediction rule, h : X → Y , also called a predictor, a hypothesis, or a classifier.
○ The learner returns h upon receiving the training sequence S.
○ It can be used to predict the label of new domain points (like the past ones).
Supervised Learning: A Formal Model

Data-generation Model:
● Let D be a probability distribution over X x Y, i.e., D is joint probability distribution over domain
points and labels.
○ A distribution Dx over unlabeled domain points (sometimes called marginal distribution),
○ A conditional probability over labels for each domain point, D((x, y) | x).

Independent and Identically Distributed (I.I.D.) Assumption


● Each domain point x has the same prior probability distribution (to be sampled):
P(xi) = P(xi+1) = P(xi+2) = · · · ,
and is independent of the previous examples:
P(xi) = P(xi | xi-1 , xi-2 , . . .) .
Supervised Learning: A Formal Model

More formally, the task of supervised learning can be defined as -


Given a training set (S) of m example input-output pairs,

We call the output y(i) the


ground truth — the true answer
we are asking our model to
predict.

where each pair was generated by an unknown function y = f (x),


discover a function h that approximates the true function f .
Supervised Learning Process

Training
Phase

Inductive Learning: given


Learner a set of observations, it
Hypothesis
finds a function that is
Space 𝓗 (𝚪: S → h) applicable to the entire
instance space. .

Stationarity:
Follows the
Final Hypothesis or
Model (h) Test
same
Phase
distribution as A Test Instance Prediction
the training
instances.
Choosing a Hypothesis Space
Supervised Learning: Example

Problem: whether to wait for a table at a restaurant.

● Alternate: whether there is a suitable alternative restaurant ● Price: the restaurant’s price range ($, $$, $$$).
nearby. ● Raining: whether it is raining outside.
● Bar: whether the restaurant has a comfortable bar area to wait in. ● Reservation: whether we made a reservation.
● Fri/Sat: true on Fridays and Saturdays. ● Type: the kind of restaurant (French, Italian, Thai, or Burger).
● Hungry: whether we are hungry right now. ● WaitEstimate: host’s wait estimate: 0–10, 10–30, 30–60, or >60
● Patrons: how many people are in the restaurant (values are None, minutes.
Some, and Full).
Supervised Learning: Example

Problem: whether to wait for a table at a restaurant.

. Unknown
Training . Target
Data . function 𝑓
.

Instances

Instance
Space (𝑿)
2 x 2 x 2 x 2 x 3 x 3 x 2 x 2 x 4 x 4 = 9216
Size of Hypothesis Space (| 𝓗 |)
of Boolean Functions
= 29216
Hypothesis Space vs. Hypothesis

What do we mean by a Hypothesis Space (a.k.a. Model Class) and a hypothesis?

There are three different levels of specificity for using the term Hypothesis or Model:

● a broad hypothesis space (like “polynomials”),

● a hypothesis space with hyperparameters filled in (like “degree-2 polynomials”), and

● a specific hypothesis with all parameters filled in (like 5x2 + 3x − 2).


Hypothesis Space vs. Hypothesis

What do we mean by a Hypothesis Space (a.k.a. Model Class) and a hypothesis?

There are three different levels of specificity for using the term Hypothesis or Model:

Hyperparameter:
Polynomials degree=1

Parameters:
a=2, b=3
Hypothesis Space vs. Hypothesis

How do we choose a good Hypothesis Space or Model Class?

Hyperparameter:
Polynomials degree=1

Parameters:
a=2, b=3

Hypothesis Space / Representation / Model Class Selection Optimization


(popularly known as Model Selection) or Training
Hypothesis Space Selection is Subjective

Most probable hypothesis given the data -


𝒉∈𝓗 𝒉∈𝓗

● We can say that the prior probability P(h) is high for a smooth degree-1 or -2 polynomial
and lower for a degree-12 polynomial with large, sharp spikes.
Hypothesis Space Selection is Subjective

The observed dataset S alone does not allow us to make conclusions about unseen instances.
We need to make some assumptions!

● These assumptions induce the bias (a.k.a. inductive or learning bias) of a learning
algorithm.

● Two ways to induce bias:


○ Restriction: Limit the hypothesis space (e.g., degree-2 polynomials)
○ Preference: Impose ordering on hypothesis space (e.g., prefer simpler than complex)
Hypothesis Space Selection is not only subjective but is empirical also.

● Part of hypothesis space selection is qualitative and subjective:


We might select polynomials rather than decision trees based on something that we
know about the problem,

and

● part is quantitative and empirical:


Within the class of polynomials, we might select Degree = 2, because that value performs
best on the validation data set.
Experimental Evaluation of Learning Algorithms

The overall objective of the Learning


Algorithm is to find a hypothesis that -

● is consistent (i.e., fits the training


data), but more importantly,

● generalizes well for previously


unseen data. Hypothesis Learner
Space 𝓗 (𝚪: S → h)

Experimental Evaluation defines ways


to Measure the Generalizability of a
Learning Algorithm.
Final Hypothesis or
Model (h)
Experimental Evaluation of Learning Algorithms

Sample Error
The sample error of hypothesis h with respect to the target function f and data sample S is:

It is impossible to asses
true error, so we try to
estimate it using sample
error.
True Error

The true error of hypothesis h with respect to the target function f and the distribution D is
the probability that h will misclassify an instance drawn at random according to D:
Generalization Error

Generalization error (a.k.a. out-of-sample error) is a measure of how accurately an algorithm is


able to predict outcome values for previously unseen data.

Variance Bias Irreducible Error


Due to the model’s Due to Wrong Assumptions. Restrictions Due to the noisiness of
sensitivity to small imposed by - the data itself.
variations in the
training data. The Representation Function (i.e., Hypothesis The only way to
space, such as, linear or quadratic) handle it is to clean up
It leads to overfitting! the data properly,
The Search Algorithm (e.g., Grid search or Beam
search) detect and remove
outliers.
It leads to underfitting!
Choosing a Hypothesis Space - I

One way to analyze hypothesis spaces is by

● the bias they impose (regardless of the training data set), and

● the variance they produce (from one training set to another).


Bias

The tendency of a predictive hypothesis to deviate from the expected value when averaged
over different training sets.

● Bias often results from restrictions


imposed by the hypothesis space.

● We say that a hypothesis is


underfitting when it fails to find a
pattern in the data.
Variance

The amount of change in the hypothesis due to fluctuation in the training data.

● We say a function is overfitting the data


when it pays too much attention to the
particular data set it is trained on.

● It causes the hypothesis to perform


poorly on unseen data.
Bias–Variance Trade-off

● High Variance-High Bias


The model is inconsistent and also
inaccurate on average

● Low Variance-High Bias


Models are consistent but low on
average

● High Variance-Low Bias


Somewhat accurate but inconsistent on
average

● Low Variance-Low Bias


Model is consistent and accurate on
average
Analogy with throwing darts at a board.
Choosing a Hypothesis Space - II

Another way to analyze hypothesis spaces is by

● the expressiveness (i.e., ability of a model to represent a wide variety of functions or


patterns) of a hypothesis space, and
○ Can be measured by the size of the hypothesis space

● the model complexity (i.e., how intricate the relationships a model can capture) of a
hypothesis space.
○ Can be estimated by the number of parameters of a hypothesis

Note-1: Sometimes the term model capacity is used to refer to model complexity and
expressiveness together.
Note-2: In general, the required amount of training data depends on the model complexity,
representativeness of the training sample, and the acceptable error margin.
Choosing a Hypothesis Space - II

There is a tradeoff between the expressiveness of a hypothesis space and the computational
complexity of finding a good hypothesis within that space.

● Fitting a straight line to data is an easy computation; fitting high-degree polynomials is


somewhat harder; and fitting unusual-looking functions may be undecidable.

● After learning h, computing h(x) when h is a linear function is guaranteed to be fast, while
computing an arbitrarily complex function may not even guaranteed to terminate.

For example:
● In Deep Learning, representations are not simple but the h(x) computation still takes
only a bounded number of steps to compute with appropriate hardware.
Bias-Variance vs. Model’s Complexity

The relationship between bias and variance is closely related to the machine learning concepts
of overfitting, underfitting, and model’s complexity.

● Increasing a model’s complexity


typically increases its variance and
reduces its bias.
● Reducing a model’s complexity
increases its bias and reduces its
variance.

This is why it is called a tradeoff.

Optimal Model’s
complexity complexity
Learning as a Search

Given a hypothesis space, data, and a bias, the problem of learning can be reduced to one of
search.

Hypothesis Learner
Space 𝓗 (𝚪: S → h)

Final Hypothesis or
Model (h)
A Test Instance Prediction
Evaluation
Generalizing to Unseen Data

The error on the training set is called the training error (a.k.a. resubstitution error and
in-sample error).

● The training error is not, in general a good indicator of performance on unseen data. It's
often too optimistic.

● Why?
Generalizing to Unseen Data

To predict future performance, we need to measure error on an independent dataset:

● We want a dataset that has played no part in creating the model.

● This second dataset is called the test set.

● The error on the test set is called the test error (a.k.a. out-of-sample error and
extra-sample error).

Given a sample data S, there are methodologies to better approximate the true error of the
model.
Holdout Method

● Shuffle the dataset and partition it into two disjoint sets:


Dataset
○ training set (e.g., 80% of the full dataset); and

○ test set (the rest of the full dataset).


Shuffled Dataset
● Train the estimator on the training set.

● Test the model (evaluate the predictions) on the test set. Train Test

It is essential that the test set is not used in any way to create the model. Don't even look at it!
● 'Cheating' is called leakage.
● 'Cheating' is one cause of overfitting
Holdout Method: Class Exercise

Standardization, as we know, is about scaling the data. It requires calculation of the mean and
standard deviation.

When should the mean and standard deviation be calculated? And Why?
(a) before splitting, on the entire dataset, or
(b) after splitting, on just the training set, or
(c) after splitting, on just the test set, or
(d) after splitting, on the training and test sets separately,

What to do when the model is deployed?


Facts about Holdout Method

● The disadvantages of this method are:


○ Results can vary quite a lot across different runs.
○ Informally, you might get lucky — or unlucky
i.e., in any one split, the data used for training or testing might not be
representative.

● We are training on only a subset of the available dataset, perhaps as little as 50% of it.
From so little data, we may learn a worse model and so our error measurement may be
pessimistic.

● In practice, we only use the holdout method when we have a very large dataset. The size
of the dataset mitigates the above problems.

● When we have a smaller dataset, we use a resampling method:


○ The examples get re-used for training and testing.
K-fold Cross-Validation Method

The most-used resampling method is k-fold cross-validation:

● Shuffle the dataset and partition it into k disjoint subsets of equal size.
○ Each of the partitions is called a fold.
○ Typically, k=10, so you have 10 folds.

● You take each fold in turn and use it as the test set, training the learner on the remaining
folds.

● Clearly, you can do this k times, so that each fold gets 'a turn' at being the test set.
○ By this method, each example is used exactly once for testing, and k-1 times for
training.
K-fold Cross-Validation: Pseudocode

● Shuffle the dataset D and partition it into k k= 5 folds


disjoint equal-sized subsets, D1, ... ,Dk

● for i = 1 to k: Test

○ train on D \ Di
Test
○ make predictions for Di
○ measure error (e.g. MAE) .
.
.
● Report the mean of the errors
Test
Facts about K-fold Cross-Validation

● The disadvantages of this method are:

○ The number of folds is constrained by the size of the dataset and the desire
sometimes on the part of statisticians to have folds of at least 30 examples.

○ It can be costly to train the learning algorithm k times.

○ There may still be some variability in the results due to 'lucky'/'unlucky' splits.

● The extreme is k = n, also known as leave-one-out cross-validation or LOOCV.


Nested K-fold Cross-Validation Method

In case of hyperparameter (parameters of the model class, Dataset


not of the individual model) or parameter tuning, we
partition the whole dataset into three disjoint sets:
Shuffled Dataset
● A training set to train candidate models.

● A validation set, (a.k.a. a development set or dev set)


Train Dev. Test
to evaluate the candidate models and choose the
best one.
Train and Select
● A test set to do a final unbiased evaluation of the the best model
best model.
Merge, Train and
Test the model
K-fold Cross-Validation can be applied to validation set
(inner CV) and test set (outer CV) in a nested way. Merge, Train and
Deploy the model
Model’s Performance

Training high
Underfitting
Error

low

Validation high
Overfitting
Error

low

Test high I.I.D.


Error Violation

low

Good Model
Model’s Performance

Training high
Underfitting
Error

low Underfitting Overfitting

Need More Complex Need Simpler Model


high Model
Validation
Overfitting
Error
Need Less Regularization Need More Regularization

low Need More Features Remove Extra Features

More Data Doesn’t Work Need More Data


Test high I.I.D.
Error Violation

low

Good Model
Evaluation - II & III
Metrics and Loss Functions (DIY)

You might also like