Lec 02 03
Lec 02 03
Lecture 02-03
Arpit Rana
3rd / 6th January 2025
Supervised Learning
Problem Settings and Examples
Supervised Learning: A Formal Model
● Label set
A set of possible labels, Y. e.g., {0, 1}, {-1, 1}.
● Training data
S = ((x1, y1) . . . (xm, ym)) is finite sequence of pairs in X x Y, i.e., a sequence of labeled domain points.
Data-generation Model:
● Let D be a probability distribution over X x Y, i.e., D is joint probability distribution over domain
points and labels.
○ A distribution Dx over unlabeled domain points (sometimes called marginal distribution),
○ A conditional probability over labels for each domain point, D((x, y) | x).
Training
Phase
Stationarity:
Follows the
Final Hypothesis or
Model (h) Test
same
Phase
distribution as A Test Instance Prediction
the training
instances.
Choosing a Hypothesis Space
Supervised Learning: Example
● Alternate: whether there is a suitable alternative restaurant ● Price: the restaurant’s price range ($, $$, $$$).
nearby. ● Raining: whether it is raining outside.
● Bar: whether the restaurant has a comfortable bar area to wait in. ● Reservation: whether we made a reservation.
● Fri/Sat: true on Fridays and Saturdays. ● Type: the kind of restaurant (French, Italian, Thai, or Burger).
● Hungry: whether we are hungry right now. ● WaitEstimate: host’s wait estimate: 0–10, 10–30, 30–60, or >60
● Patrons: how many people are in the restaurant (values are None, minutes.
Some, and Full).
Supervised Learning: Example
. Unknown
Training . Target
Data . function 𝑓
.
Instances
Instance
Space (𝑿)
2 x 2 x 2 x 2 x 3 x 3 x 2 x 2 x 4 x 4 = 9216
Size of Hypothesis Space (| 𝓗 |)
of Boolean Functions
= 29216
Hypothesis Space vs. Hypothesis
There are three different levels of specificity for using the term Hypothesis or Model:
There are three different levels of specificity for using the term Hypothesis or Model:
Hyperparameter:
Polynomials degree=1
Parameters:
a=2, b=3
Hypothesis Space vs. Hypothesis
Hyperparameter:
Polynomials degree=1
Parameters:
a=2, b=3
≣
𝒉∈𝓗 𝒉∈𝓗
● We can say that the prior probability P(h) is high for a smooth degree-1 or -2 polynomial
and lower for a degree-12 polynomial with large, sharp spikes.
Hypothesis Space Selection is Subjective
The observed dataset S alone does not allow us to make conclusions about unseen instances.
We need to make some assumptions!
● These assumptions induce the bias (a.k.a. inductive or learning bias) of a learning
algorithm.
and
Sample Error
The sample error of hypothesis h with respect to the target function f and data sample S is:
It is impossible to asses
true error, so we try to
estimate it using sample
error.
True Error
The true error of hypothesis h with respect to the target function f and the distribution D is
the probability that h will misclassify an instance drawn at random according to D:
Generalization Error
● the bias they impose (regardless of the training data set), and
The tendency of a predictive hypothesis to deviate from the expected value when averaged
over different training sets.
The amount of change in the hypothesis due to fluctuation in the training data.
● the model complexity (i.e., how intricate the relationships a model can capture) of a
hypothesis space.
○ Can be estimated by the number of parameters of a hypothesis
Note-1: Sometimes the term model capacity is used to refer to model complexity and
expressiveness together.
Note-2: In general, the required amount of training data depends on the model complexity,
representativeness of the training sample, and the acceptable error margin.
Choosing a Hypothesis Space - II
There is a tradeoff between the expressiveness of a hypothesis space and the computational
complexity of finding a good hypothesis within that space.
● After learning h, computing h(x) when h is a linear function is guaranteed to be fast, while
computing an arbitrarily complex function may not even guaranteed to terminate.
For example:
● In Deep Learning, representations are not simple but the h(x) computation still takes
only a bounded number of steps to compute with appropriate hardware.
Bias-Variance vs. Model’s Complexity
The relationship between bias and variance is closely related to the machine learning concepts
of overfitting, underfitting, and model’s complexity.
Optimal Model’s
complexity complexity
Learning as a Search
Given a hypothesis space, data, and a bias, the problem of learning can be reduced to one of
search.
Hypothesis Learner
Space 𝓗 (𝚪: S → h)
Final Hypothesis or
Model (h)
A Test Instance Prediction
Evaluation
Generalizing to Unseen Data
The error on the training set is called the training error (a.k.a. resubstitution error and
in-sample error).
● The training error is not, in general a good indicator of performance on unseen data. It's
often too optimistic.
● Why?
Generalizing to Unseen Data
● The error on the test set is called the test error (a.k.a. out-of-sample error and
extra-sample error).
Given a sample data S, there are methodologies to better approximate the true error of the
model.
Holdout Method
● Test the model (evaluate the predictions) on the test set. Train Test
It is essential that the test set is not used in any way to create the model. Don't even look at it!
● 'Cheating' is called leakage.
● 'Cheating' is one cause of overfitting
Holdout Method: Class Exercise
Standardization, as we know, is about scaling the data. It requires calculation of the mean and
standard deviation.
When should the mean and standard deviation be calculated? And Why?
(a) before splitting, on the entire dataset, or
(b) after splitting, on just the training set, or
(c) after splitting, on just the test set, or
(d) after splitting, on the training and test sets separately,
● We are training on only a subset of the available dataset, perhaps as little as 50% of it.
From so little data, we may learn a worse model and so our error measurement may be
pessimistic.
● In practice, we only use the holdout method when we have a very large dataset. The size
of the dataset mitigates the above problems.
● Shuffle the dataset and partition it into k disjoint subsets of equal size.
○ Each of the partitions is called a fold.
○ Typically, k=10, so you have 10 folds.
● You take each fold in turn and use it as the test set, training the learner on the remaining
folds.
● Clearly, you can do this k times, so that each fold gets 'a turn' at being the test set.
○ By this method, each example is used exactly once for testing, and k-1 times for
training.
K-fold Cross-Validation: Pseudocode
● for i = 1 to k: Test
○ train on D \ Di
Test
○ make predictions for Di
○ measure error (e.g. MAE) .
.
.
● Report the mean of the errors
Test
Facts about K-fold Cross-Validation
○ The number of folds is constrained by the size of the dataset and the desire
sometimes on the part of statisticians to have folds of at least 30 examples.
○ There may still be some variability in the results due to 'lucky'/'unlucky' splits.
Training high
Underfitting
Error
low
Validation high
Overfitting
Error
low
low
Good Model
Model’s Performance
Training high
Underfitting
Error
low
Good Model
Evaluation - II & III
Metrics and Loss Functions (DIY)